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FOREWORD > 



Measurement apd testing are central to the educational process and have beer* with 
us for many years. The modern era of measurement began in the 1920's and has not 
changed substantially since that time. This Research Memo serves as a vehicle to 
comm unicate NEA's pa st and prese nt p osition on standardized testing. Testing' 
resolutions adopted by the NEA Representative* Assembly in July T980aftd after 
the preparation q{ the body of this report appear in Appendix El In addition, 
current testing issues, such as coachability of aptitude tests and truth-in-testing 
legislation are reported.* % * 

With the computfc^issistance of Susan Falsey, Chet McCall and I authored the 
chapter on the results.of the multivariate analysis on the coachability of the SATs. 
Larjy Robinsdn completed the state survey of commercial involvement in statewide 
testing programs. , „ - 
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Introduction 



Jhis Research Memo is fortest users, that is, for teachers^administrators, counse- 
lors, curriculum specialists, school board members, and legislators. Others who 
may be interestedSn this report are people who work in elementary and secondary 
schools, institutions of higher education, and programs for adult learners, and who 
work* with children teenagers, or adults. Above all, people who use tests as a means 
for improving edteatign should find this Memo of value. , 

The purpose of this report is to provide general background information about 
measurement and testing:. The information includes discussions of the meaning 
'attributed to educational measurements ,the language qf testing,^guidelines for test 
selection, and the uses of test data. A number of problems and issues associated with 
t&ts andltestmg practices^afe also djscussed. Ibis background is no % t a substitute for 
excellentuextbooks in measurement ^nd testing philosophy, principles of testing, 
and test Valuation rfor is it intended to guide test development -er direct tfye 
evaluation^tftesting programs. Instead, the information, together with continued 
.inquirjfinto te^s^nd testing practices, can help promote informed andxesj)onsible 
use of tests and test data, {n this wa# the Memo can help people use test r^ults as an 
aid in improving education. ' 

There krefivesp^nons in this report: 

* I. A historical examination of testing and the developments in psycholc 
jf and mathematics, that helped shape contemporary testing practices, 
review of testing practices^ : ' * 

f . II. The issu^bf cqachingfor college admi^sion^xajpinations, includirtgdafa 
supporting the hypothesis*'that standardized tests designed to measure 
'aptitude are coachable.** v <v 

* III. Results of an NEA review of comnierciah involvement in statewide 
tqsting programs. * f ' , " 

IV. * A review ot the NEA position on testing, beginning 1 with the 1972 
% moratorium and; including policy revisions and current policy concern- 
ing students, testing, and the instructional process. / 

V, Truth-in-testmg legislation^ including state and federal laws. Arguments 
* supporting-testing legislation; arguments defending current testing prac- 
tices; arid ; tfte NEA position on truth-in-testing legislation, full disclo- 
sure, and open testing. ^ „ ' 

0 

The five sections treat the subjects of measurement and testing broadly. Fitters 
who seek greater detail and who wish to pursue a specific subject op thfcir own mpy 
find useful the recommended reading'list conclu4ing*most sections. 
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SECTION I: HISTORICAL AND TECHNICAL BACKGROUND 



Historical Background 




^Measutemen.t.jn_educaLion has a long historyJ__As ^rlv as 4000 B.C., written 
examinations were part of th$. Chinese civil service system and were used to measure 
how much civil servants had lea~rned» The ancient Spartans had^an elaborate series 
of tests designed to measure mastety of skill£ of manhood, andjn Athens Socrates 
refined a kiifd of measure designed to enrich and-extend the learning of pupils. 
During the Middle Ages, the University pf Pahs introduced the oral examination 
for master's degree carididates. The practice spread throughout European universi- 
ties and \Vas extended in 1787 by Frederick William II of Prussia to include 
secondary students seeking university admission. 

1 t 

In the United States, measurement appeared almost as soon as the first schools were 
built. A variety of measures were used to assess student achievement, and the oral 
Examination was one of the most popular. It was a quick and easy measure to use t 
inexpensive, and providecf. lots of information. It was also controversial. Some 
students arid teachers complained that the exams were unfair and used for punitive 
reasons. Horace Mann thought so, too; and he successfully argued that written 

rather than oral exams would provide more objective information. 

♦ ✓ 

As measurement is known today, however, it is barejy £0 ydfcrs old. It has^luring 
that time become a formaj and systematic process complete with theory, a special 
language, a set of traditions, and what W. James -Pophaip has jelled a "well 
established set of expectations/* 2 The expectatf&ns are that important and sophisti- 
cated t^st development is the task of measurement specialists^andJhat the primary 
purpose of these tests is to detect diversity among' students relative to some 
' measured ability. According to Popham, the expectations are not.ex^ctjy wrong, 
for they make sense in light of the history of modern testing and the way tests have 
been used in the past. Nor are the expectations necessarily right, for they represerita 
narrow and limiting view of what measurement means, how tests can be used, and 
the 1 role both measurement and tests can play in education. 

Current expectations pf educational measurement have a history. If understood, 
this history can illuminate the present and can give .some hint of future_change.> 
Therefore, a discussion of current measurement and testing begins properly with 
the past. j - 

During the nineteenth century, two trends began to shape contemporary ediica-s 
tional measurement. The first trend in physiology emerged as a group of Europeafa 
scientists begarPstudying human behavior., In 1811, Sir Charles Bell arid later 
Fraricpis Magendie c^scov^ed anatomical and functional differences between 
.sensory and motor nerves. This discovery separated nerve physiology into the study 
of •sens&tipn and movement and was followed by a great* deal of work with 
sensation. Sorafe scientists begarf studying reflex action and what astronomers 
called" the "personal equation, V or wilat is now called reaction time. 
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The century began with wide acceptance of Immanuel Kant's assertion that psy- 
chology could, not be experimental. By mid-century some scientists cautiously 
speculated that {he .mind might be studied empirically. At this point psychology 
broke from its affiliation with philosophy and physiology and became known as the 
new field of experimental psychology. 

Among the interests of early experimental psychologists were the ideas of individ- 
ual differences and complex mental life and the hypothesis that complex mental life 
was comprised of a combination of sensory experiences. These ideas had intrigued 
philosophers, but psychologists tried to measure them. They adapted to their 
purposes known methods of scientific inquiry and in time established psychological 
laboratories of whiqh two gained particular prominence*: the Wundt laboratory in 
Germany and the Anthropometric Laboratory established by Sir Francis Galton in 
England. 



Wilhelm Wundt and his associates in Germany studied primarily human sensitivity 
to sensory stijnuli'and reaction time, and they searched for uniformity in human 
behavior. Their legacy to contemporary measurement, however, was methodology. 
They emphasized rigorous control of experimental conditions and demanded 
accuracy, precision, order, and the reproducibility of research results, 'all of which * 
laid the*foujidation for contemporary measurement standards for objectivity, 
reliability, and validity. 

Sir Francis Galton's initial interest began^with the inheritance of genius andled to 
the measurement of human faculties. Galton is credited with inventing the test as an 
experimental method; and he and his associates developed a number .of tests far 
studying individuals and, in particular, individual, differences. 9 , f 

Thus, the first trend in physiology began to shape into scientific form the study of 
human behavior. Among thecontributions of these early scientists were a theoreti- * 
cal approach to the study of human behavior, the concepts of human attributes 
utfiereby individual similarities and differences could be studied, an array of instru- 
ments for recording human behavior, and a rigorous methodology. 

N fx* 

While physiologists were concerned with sensation and movement, a second trend 
that influenced contemporary measurement emerged in mathematics. There was 
during the eafly nineteenth century considerable interest ih observational and 
instrumental error. The interest was pronounced in astronomy where perfect 
observations were essential for calibration of thcclock. During the 1820s, Friedrich 
m WiJhelm Bessel inyestigated obseryationaj error and discovered personal errors of 
observation among astronomers observing the same evenfand by an individual x 
observing across time. Besse^ presented the'ftbset^er differences as a mathematical 
equation, and efforts were made by astronomers' to determine these "personal 
equations'* and to correct for them. *V ^ 

In 1733, Abraham de Moivre formulated a mathematical theory of error called t|ie 
theory of probability; Pierre Simon Laplace -and Karl Friedrich Gauss demon- 
strated its usefulness as a mathematical tool ea/ly during the' eighteenth century; 
and in 1$46, Lambert Quetejet applied it successfully to the measurement of human 
attributes. This application led to wfrat-is known as the normal curve. It is a 
mathematical model representing the expected distribution of some variable when 
an infinite* number 6f observations is made. Inspired by this*uccess, Sir Francis 
Galton, who was studying individual differences and experimenting with tests of/ 
mental ability, began to explore ways of applying mathematics to human measure- ' 
jndnt. One. of the concepts he worked out was statistical correlation. His contem- 
porary, Karl Pearson, derived the mathematical formulation for the concept. 
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By the turn of the century, a number of psychological concepts and mathematical 
formulations had been developed in Europe for the study of human behavior. When 
James K. Cattell returned tQthe United States after doctorate study in Germany, he - 1 
brought with fiim many of these concepts and tbols. Cattell established psycholq^/ 7 
cal laboratories at the University of Pennsylvania and Columbia where he stresse^d 
experimental rigor. He \ijas.also interested in Galton's work with individual differ- 
ences and 'testing. He began developing what he called mental tests and eventually , 
administered them to entering freshmen at Pennsylvania and Columbia. The tests 
were of various sensory attributes such as .reaction time and visual acuity. 

Ftfr decades psychologlsfr4*a<Lapproached mental measure men*?and complex 
Rental life through the study of simple sensory experiences. This was the-theoretical 
orientation of Cattell, and it was The approach challenged by Alfred Binet. Binet 
believed complex mental ability £ould be measured directly and that the concept * 
itself could be reduced to a number of specifieabilities. Binet and Theophile Simon 
set out to measure complex ability directly and for that purpose developed a scale 
which they revised and refined over several years. Binet then constructed a formal 
numerical base for translating test performance into mathematical language. Thus, 
he succeeded in developing a measure of the characteristic he called intelligence in 
such a way that the characteristic could be tested. >, 

Binet *s work had profound significance for educational measurement, but it alone 
was not responsible for the "testing boom" that occurred later. The work of Binet Army Alpha 
and others in France had been carefully followed by psychologists in the United 
States. When World War I broke out, the military needed to assemble a huge army 
quickly and wanted some objective way to classify and place new recruits. Psychol- 
ogists reasoned that if they could adapt Binet *s tests of individuals to groups, they 
could provide that objective procedure. Their efforts resulted in group tests of 
mental ability called Army Alpha and Army Beta. The project demonstrated the 
w feasibility of testing many people quickly and simultaneously and was regarded at 
that time as'enormously successful. . 

After the war, the idea of group tests took hold. A number of tests were constructed, 
and most were fashioned aftepthe Army Alpha. Most, like Army Alpha, focused on 
some mental -attribute, were of the paper-and-pencil type, and were designed to 
% differentiate among people. Regarded as technically complex and scientific instru-e 
ments, Jhe testes were ajso subject to special protection. The belief was that if tests 
were available to untrained and irresponsible. people, they would be misused and 
spoiled. 

Few people questioned publicly the assumptions underlying the new tests, although 
the tests themselves and more often their use were debated publicly. 3 Nevertheless, Testing boom 
9 many tests were constructed and eventually used in the schools. The schools 
provided a logical setting for the new tests. A tradition of testing already existed, 
educators showed interest in the new measures; and the tests provided an attractive x 
alternative to the methods then in use. Mass production made many tests available 
at reasonable cost. Electronic scoring and computer processing made possible the 
analysis of data with unprecedented accuracy and speed. General public support of 
t |he testing movement and, eventually, publicly issued testing mandates all helped 
make testing a common educational practice. \ 

Technical Background: The Meaning of Measurement ' 

i 

In one sense, theory is a symbolic representation of experience. 4 It is a way of v 
making sense of experience. Wjth theory the logic of experience is reconstructed so ^ 
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that experience can be contemplated, interpreted, criticized, and unified. Theories 
are modified as new and unanticipated data are found. They are discarded when 
they are no longer consistent with the data. New theories take the place of discarded 
theories and hfcfp continue the discovery of generalizations about experience. Thus, 
theory guides inquiry; it helps explain the present; and it also has the quality of 
tentativeness. 

The meaning bf measurement is provided by theory, and in education prevalent 
theories concern cognition, learning, and instructionalpractice. For this reason, tfie 
measurement of mental traits has many implications for teaching and learning. 

A mental trait is an hypothesized attribute of people. G.C. Helmstadter describes a 
trait as an abstract attributed It is not concrete in the sense that it can be known 
through the senses. One* such trait «s*hUity, It cannot be directly seen, heard, 
touched, or smelled. The trait cannot be known directly at the concrete level of 
experience. It is an abstraction and is postulated theoretically as an attribute of 
people. If ability is anattribute important to the learning process, then its measure- 
ment would provide information useful to instruction. % The problem is how to 
measure something that has, no concrete form. 

' In theory the trait is presumed to exist and to manifest itself in certain forms of 
observable human behavior. The task becomes one of identifying those behaviors 
assumed to reflect the trait, defining the trait so that it can be measured, and 
'constructing an instrument powerful enough to assess the behavior of interest. A 
test, then, can be thought of as a way of obtaining examples of h uman behavior. The 
examples are given a numerical value assumed'to resemble the measureof the real 
underlying trait. Hence, a measurement* has been made/ 

- Obviously, measurement is npt an end in itself. Its scientific value can be best 
appreciated as an instrument leading to action. With the assumption that the theory 
guiding measurement is appropriate, tf\e meaning of measurement is derived from 
the ends, it is intended to serve, the role it is called ufcon to play, and the functions it 
performs ir\ inquiry. x 



One function of* measurement is to refinos,descriptions and make them more precise. 
Using numbers to refcraent traits and their propertiesjallows minute distinctions to 
be made bet)weett-tfbWved similarities and differences. With precision, classifica- 
tion systemfc emerge and ambiguity dissolves to the extent that knowledge permits. 
Precision does not mean that disagreements are impossible. Disagreements con- 
tinue to exist; butnhey are sharpened, at times refocused, sometimes dissolved. 

A second function of measurement is its practical utility. The meaning of measure- 
ment is often associated with the way it is used and the ends it helps achieve. Since 
tests in education are functionally viewed as providing information for decision 
flaking, the actual functions of testing are referred to in terms of the kinds of 
decisions they serve. 

J-ee J . Cronbach uses a three-category system to talk about the practical function of 
tests in terms of their decision-making function.WThe categories are research;, 
evaluation; and selection, classification, and pla^#ent decisions. 




In research an investigator may be interested in the hypothesis that no relationship 
exists between coaching as a form of preparation to'take a certain^st and student 
performance on that test. The investigator may use that test in an experiment 
designed to test the hypothesis. Use of the test would help the researcher decide 
whether to accept or reject the hypothesis. 
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Evaluation is»a second kind of situation for which tests are used. Here various 
measures of the results of a specific training or educational program are obtained 
and provided ty various audiences* The purpose ofthe measures is to provide 
evidence in the process of judging the worth or merit of the program. If, for 
example, a curriculum committee is evaluating a reading program, test scores may 
be used asone kind of information to help committee membfersjudge the program's 
worth. ' , 

A third kind of situation involves selection, classification, and placement. In 
selection decisions, some individuals are chosen by preference from among others. 
Selection implies rejection as some people are chosen antf others are not. Admitting 
students to law school is a selection decision for which tests are used. 

Classification decisions involve a systematic arrangement of individuals for pro- 
spective treatment. On the Jjasis of a music performance testator example, students 
might be classified as beginning, intermediate, or .advanced students. Based on this 
classification, students receive different instructional treatment. 

PJacement is a special kind of classification. Like classification, placement implies 
different treatment 7oc different people. Unlike classification \yhere differential 
treatment is temporary, placement involves relatively long-term differential treat- 
ment. One such placement decision occurs when individuals are classified on the 
basis of some measured trait and then placed in differing instructional programs. 
} ,The programs often continue for the duration of a student's elementary pr secon- 
dary school experience, and there is little student movement from one program to 
another. 

Another way of classifying educational decisions 'focuses on the decision maker and 
on the power implied by the act of making a decision. Fromjhis perspective, 
decisions can be classified as institutional or individual. Institutional decisions 
serve institutional needs. They are made in a centralized setting, generally involve 
policies and guidelines^ and^may involve categories of slkidents with identical or , 
similar characteristics. A college admissions committee, for example, may make 
admission decisions about several thousand applicants and may in the process use 
test data to help identify preferred students. This would be considered an institu- 
tional decision. * 

Individual decisions serve individual rather than institutional needs or preferences. 

Individual'decisions typically take place in decentralized settings. The individual 

*may use information from tests and from other people. The decision fs a matter of 

irtdividual choice, and the power #nplied in decision making rests with the 

individual. . ■ * . ~~ 

\ 

There are otber ways of classifying the functional use of testing, but the two systems 
described above are cqmfnon. The systems focus on the decision making function. 
Regardless of the kind of decision made, Lee J. Cronbach argues that all decisions „ 
involve prediction. According to Cronbach, a test might provide interesting infor- 
mation about individual differenrfes^but threfact might not be worth knowing if one * 
could not predict that these s4me individuals will differ rfi some future moment with 
respect to the same or some other measure. For example, committee members 
gather information to evaluate a reading program. They Examine the evidence and 
determine that the program is effective and should be continued. Implicit in the 
decision is the prediction that program effectiveness will continue given the same or 
similar circumstances. The university requires a test score as part of the application 
for admission procedure. The test score may be used as the basis for predicting 
whether students win successfully complete their ftrsLyear in college. The teacher 
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administers a music performance test. *£he prediction here is that certain skill levels 
can 'be best developed by certam instructional treatment. * 

The predictive aspect of decision .making, of testing, and of thV way traits are 
conceptualized is important. Decisions are made with the expectation that certain 
desired dutcpmes are likely to occur. Tests are administered with the expectation 
that resulting information will improve the predictive dimension of decisibh mak- 
ing. From thepractical point of view, predict ion is a-function'of measurement, and 
the predictive power of a test is an important characteristic. . s * 
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Technical Background: The Language of Testing 

The term prediction is one of many words associated with the language of testing. It 
is associated with one of the moi;e technical aspects of testing to be discussed on the * rX 
following pages.. First,, it might be useful to consider the way tests are classifed and 
then to discuss technical characteristics and standards Jbr evaluating tests. 

• Tests are classified generally along one of two dimensions: the way tteits'are 
measured and the kind oftrait measured. Consider* first the way tFaits are measured. 
Common descriptors are objective and subjective tests, stand^ized tests, norm- 
referenced and criterion-referenced' tests., The meaning of many of these tests is 
obvious, but some discussion might be appropriate. Group tests are administered to » 
many individuals simultaneously, but they can be given to single individuals if 
necessary. Individual tests usually require the manipulation df apparatus and 

r careful questioning or observation and must be administered to one student at a 
^ time? All tests can be regarded as requiring some kind of performance, but a 
^performance test usually refers to a task requiring no verbal response. 
• «* " • • * 

The terms objective and subjective refer to scoring procedures. A test is subjective if 
scoring ' involves judgment on the part of the person doing the scoring. An oral 
examination is considered quite subjective as'are many essay tests. A tes.t is said to 
be*objective if the scoring can be replicated exactly and the same score can be 
derived regardless of who scores the test and when. A multiple-choice test where 
each item has only one agreed-upon Correct answer can be quite objective. Even " 
with objective tests, however, there is such a thing as scoring. erfor. 

Standardized tests are those in which procedures, test materials, and scoring are 
fixed precisely so that duplication impossible at varying times and places.; Stand- 
ardization was one of the early procedures developed for testing. Without it, result^ 
from different experimental laboratories could not be rorapareU. Larer standard- 
ized classroom tests eoabled people to compare test results across classrooms, ■ 
schools, and regions. • - 

* * * 

Norm-referenced and criterion-referenced tests involve different standards for 
interpretation. A criterion-referenced test is one designed to describe a person's 
score or level of performance in terms of the kind of knowledge,*skill, or task'he or 
she can accomplish. A norm-referenced test 1 is one designee! to.describe perfor- 
mance in terms of a person's relative standing among others who have taken" the 
same test. * + 

Tests can also be classified-according tothe tjait measured. The broadest classifica- 
tion distinguishes between maximum performance'and typical performance. Maxi- 
mum performance tests are designed to measure a person's best performance. 
Measurement assumptions are that the test actually brings out best performance 
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and that the subject is motivated to earn the best possible score. The following tests 
, are traditionally considered measures of maximum performance: 

• -Intelligence is sometimes defined as what the test measures, and intelli- 
gence tests generally measure verbal, nonverbal, memory, and problem- Tests of maximum 
solving skills. .In this sense intelligence tests are often considered performance 
measures of general or scholastic aptitude. 

• • Aptitude tests' theoretically measure mental operations that improve 

little with practice. They provide the basis for predicting levels of future 
performance. 

• Ability tests theoretically measure functions that reflect both innate 
ability and the influence of general environmental enrichment. 

• Achievement tests are designed to measure skills, knowledge, and degree 
of accomplishment or competence acquired through some educational or 
training experience. 



Tests of typical performance are designed to measure hfow a person reacts, feels, or 
behaves. Presumably the test has the power to solicit a sample of characteristic 
behavior, aftd the subject is encouraged to demonstrate such behavior. Personality 
tpsts, interest inventories, and projective techniques are all examples of typical 
performanc^ests. , 

Tests have a number of other characteristics. t They are comprised of one or a 
number of structured items or exercises. The item provides a performance stimulus 
and structures the response. For purposes of analysis, an item is the basic scorable 
unit of a test. 

A test is comprised of a numtler of items which the individual attempts to answer. 
Each answer is classified according to some numerical scale, usually a two-category 
scale of right (= I) or wrong (=0). These numbers are called item scores. Item scores 
are then summed for a given test to yield a Faw score which is the total number of 
items right.* v < c 

When a -test W given to\a iiumb^T^f people, their raw scores can be tallied and 
described in Various waW^eTnmonly the scores are described by a frequency 
distributionJ^^-^um^ each score), the cumulative frequency 

distribution (theliumfeer ok people who obtained a given score or lower), or such 
summery statistics as th^average (arithmetic mean) and some measure of variabil- 
ity (the range or standard deviation). 

The distribution of test scores i$ influenced primarily by two item characteristics 
called item difficulty and homogeneity. % The difficulty of an item refers to the 
proportion of students in a given population or sampfe.who got the item right. 
Thus, an item with a p value of 90 means that 90 percent of the students -who 
answered that item gave the right answer. A p. value of 90 suggests a fairly easy item. 
A p value of 25 suggests a fairly difficul^item. Item difficulty affects the mean score. 

The homogeneity of an item refers to its correlation with other items" in the test. An 
item may have correlations ranging from a perfect negative correlation with 
another item (-1J through no correlation at all (0) to a perfect positive correlation 
(+1). The average correlation between all possible pairs of test items is called a 
measure of homogeneity. Both item intercorrelations and item difficulty'affect 
variability. " 



Tests of typical 
performance 



Describing test 
performance ^ 



ERLC 



13 



/ 



The distribution of test scores is affected by errors of measurement, and measure- 
ment errors art detected through theii; effect on item difficulty and the item 
intercorrelations. There are errors in any measurement, and test developers try tq e 
reduce them. But errors exist, and they keep a test from rendering perfectly valid! 
results, %i* 

Error of measurement is an aspect of a technical characteristic called validity. 
Validity means truthfulness To be valid, a test must measure accurately what it is 
supposed to measure. Validity is the single most important characteristic of a test. 

There are three types of validity. Content validity is the degree to which test item 
content expljcitly matches the purpose for which the test is to boused. Content 
validity usually involves a logical analysis of what the test contains-. A group of 
experts may examine test content and agree that the content samples the domain of 
knowledge or skill in question. Teachers may^mine test content to determine the 
match between test content, instructional content, and instructional objectives. 



A second type of validity is construct yalidity. Construct validity involves gathering 
evidence to demonstrate -that the theoreticaUtrait measured by the test can be 
verified experimentally. If two tejts measure the same trait, then student perfor- 
mance on the two tests should be more highly correlated than with performance on 
a third test designed to measure a different trait. A study of construct validity would 
demonstrate whether this were true or not. 

The third type of validities criterion-related validity (either concurrent or predic- 
tive). Criterion-related validity is the degree of accuracy with which a test score can 
predict a person's performance on -some criterion-such as performance on another 
test. If both the prediltor and the criterion measuresare gathered at the same tyne, 
the study is said to be concurrent. If the measures are gathered at different times,' ' 
then the study is said to be predictive. In both concurrent and predictive validation 



tency of a measure ctter time. Reliability is the extent to which a test is free of errors' 
when a person is measured more than once by the same instrument or when one 
administration of a test yields small errors from one test taker to another. W test° 
with high reliability is one that will yield much the same score results,for individuals 
and a group of people un^er different conditions or situatipns. ' * 



People interested^ learning about the technical aspects of educational measure- 
ment will find many available materials ranging from introductory texts tp techni- 
cal analyses of testing issues and problems. The readings recommended at the 
conclusion of this section were selected to represent that range of available 
materials. 



Standards for Test Evaluation 

Thousands of tests on the market are available for school use. Their selection 
should be based on a thorough understanding of the educational purpose of tiie test 
and test quality. Among the many considerations involved in test selection, stand- 
ards of practicality, technical characteristics, and cost-benefit are of primary 
importance. 
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Considerations of pr&cticality\n\o\\t planningand what somepeople callcommon 
sense. The following questions'are amorig those useful for (determining the practi- 
cality of a test: " * 

Who will be tested? 

• Hqw will individfTais.be tested, when, and where? 
Who will administer the test? » . . r 

Are test procedures feasible? (Consider available space and time, qualifi- 
cations required of test administrators* ease of administration andscor- 
ing, and characteristics of the people being tested.) 

What decision-making purpose will the. test data serve? 
Who will make the decision? * > * 
What are; the information needs of the decision maker? 
Will the test provide data relevant to the decision needs? 
Will the test provide data important to the decision-purpose? 
Will use' of the te3t provide timely data for the decision purpose? 
. Wh6 will be affected by the decision? . 
i In what way will individuals or groups be affected? 



Practical 
considerations 



.The following questions are useful for determining the technical characteristics of a 
test: ' <- 

Standardization ' 

How is the test administered? 

How is th^ sample selected for the norfrifcig population? 
Who was included in the norming population? 
What are the limitations of the derived scores? 

Objectivity * f 

What method of scoring is used? , 

Is the scoring system h*ee of error? ^ \ - * 

Reliability 1 ^ 

How 1 is reliability determined? - "~" 
What is the estimate of relialMlity'fcfrnhe te&tT 
'What is the'standard error of measurement for the testi 

Validity 

Does the test have validity for the situation in which it is being used? 

Does the test measure the information and/ or performance on an impor- 
tant set of tasks? * ^ ^ — 
Does the test measure current performance when comparedida standard 
or criterion measure? 

Does the test measure future performance when compared to a standard 
or criterion measure? . \ . , 

Does the test measure a, trait or set of characteristics? 
Can an experimental condition be created to test the hypothesis? 



Technical 
considerations 
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Cost-benefit* 
consideration 



Questions useful for determining cost-benefit include the following considerations: 
What is the material cost of measuring each student? ^ 

What are' the service eosts involved in testing— e.g., computer scoring, 
interpretive manuals?. 

What are the personnel costs involved in test administration? 
Can the test be locally scored? 
How much useful information will the test provide? 
Can other tests provide the same or better information and at what cost? 
Do budget allocations for testing allow purchase of this test? 
Is the test reusable? 

Can the test serve multiple decision purposes? 
Will the test provide quality information? . 



FOOTNOTES 
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SECTION II; THE ISSUE OF SAT COACHABILITY 

r : - - 

Introduction ^ ^ 

The Scholastic Aptitude Test (SATJsponsored'by the College Examination Board" 

(CEB) 1 and produced by Educational Testing Service (ETS) is one of the jnoSt 

prominent aptitude tests currently available. With a history spanning more than 50 

years, the SAT is an educational measure familiar to millions of people who in the 

past have taken the test o,r have used the test results for research, evaluation, and / or 

selection, placement, or classification decisions. The test has been administered to a 

juniorand senior high school students and has been used by colleges* part qf the 

college admission process. Thus, the test is not only familiar but also important to . 

students who-seek college entry and to college officials who grant entry privileges. 

The §AT has also gained prominence because of the claims made for its power. Asa 
test of aptitude, the SAT is alleged to be a measure of the capacity to learn and is 
believed, therefore, to be impervious to special preparation or coaching. As a test of 
scholastic aptitude, the SAT has also been promoted as an indicator of success in 
college and in particular of college grades earned by students during the first year. 

ETS aijd CEB 4iave repeatedly claimed that the SAT is not coachable and that it 
functions as a measure of'college success. They have promoted these claims in their Coaching issue 
publications and ha% cited various studies ih support of their position. In turn, the 
claims have been challenged, most recently by Warner Slack and Douglas Porter in 
an article entitled "The Scholastic Aptitude Test: A Critical Appraisal" published 
by the Harvard Educational Review. In this article the authors examine ETS and 
» yCEB claims made for the SAT in light of available data. 

Slack and Porter begin their appraisal with the 1968 CEB position paper entitled 
"Effects of Coaching on Scholastic Aptitude Test Scores. " In this paper, the stated 
. position is.that intensive drill or speciaUutoring will not significantly increase SAT 
jperformance. 2 In support of this position, seven studies are referred to in which the 
SAT was administered to students before and after some form of coaching. 

Slack and Porter examine the seven studies noted by CEB. Contrary to the 6eB> 
contention, the authors argue that the seven studies do provide evidence that 
coaching for the SAT can lead to statistically significant score changes. 3 They 
speculate that the best available coaching methods were not usediin the studies cited 
by CEB and proceed to examine other studies in which well-planned coaching 
programs for the SAT resulted in score gains exceeding any gain that could be 
expected from practice and growth alone. 4 * * 

The authors also examine studies published prior to the 1968 QEB publication on 
* coaching. 5 Of the 29 studies found, 23 wete either gonducted or sponsored by ETS • 
or CEB. The remaining 6 studies were the result of independent effort. Of the 23 
ETS or CEB studies, 15 had been cited by ETS prior to the 1968 publication on 
coaching. The weighted mean score gains for these 15 studies was*16. The weighted 
mean score gains ofstudies not reported'by ETS was 41. The^ weighted mean score 
gain$ of all studies found, including those, cited by ETS, was 29. The authors 
• » 
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High school record 
is best predictor 



conclude that research evidence does not support the claimTthat the SAT is 
uncoachable. Furthermore, the authors argue that this evidence was available prior 
to the 1968 CEB publication on coaching. 6 ^ ^ 

Slack and Porter continue their appraisal by examining data concerning the 
predictive power of the SAT. The focus of most of these studies k on correlations 
between SAT test scores and first-year coljege grades. The power of the SAT is 
expressed as a validity coefficient and indicates Che degree to which SAT scores 
predict college grades. 

Beginning with a predictive validity study conducted with the first SATs in 1 926, the 
authors report that the median validity coefficient for student SAT scbres was 0.34, 
for the high school record for these same students it was 0.52, and \for the SAT 
scores and high school records combined it wastf.55. In 1926, SAT scopes increased 
by 0.03 the predictive po\Ver provided by high school records alone. 7 The authors 
appear skeptical that an increase of 0.03 represents much improvement over the 
predictive power of high school records. 8 

* 

Slack and Porter examined studies cited in an article by W.B/SchraBer entitled 
"The Predictive, Validity of College Board Admissions Tests" in a publication by 

* CEB and validity coefficient tables published/by CfcBfor 1964 through 1974. Using 
« two- different methods of interpretation, alsq employed by ETS, the authors con- 
clude that relatively recent studies show that the SAT adds little to the predictive 
power of the high school record when considered alone. 9 Thus, as early as 1926 and 
certainly during the ten-year interval of 1964 through 1974, SAT data provided no 

* evidence that the SAT was a successful indicator of college success. 

Based on their critictl review of SAT data, Slack and Porter conclude that the SAT 
> is a standardized test of achievement, not a test of aptitude.'<> As a test of achieve- 
ment, SAT content \i far removed from most high school and college curricula. 1 1 
Furthermore,- the authors believe performance can be improved with coaching. 12 

The findings reported by Slack and Porter are supported by NEA analyses of SAT 
data obtained through the Federal Trade Commission. Described in Parts A and B 
of this section, NEA examined SAT data for evidence of the influence of coaching 
on average SAT scbres, and the degree to which coaching improves SAT scores. 
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FOOTNOTES 

c 

! The College Examination Board (CEB) was formerly the 
Examination board (CEEB). 

• 'Slack? W.V. and Porter, Douglas. "The Scholastic Aptitude 
Appraisal. ** Harvard Educational Review 50: 156. May 1$80. 

}lbid., p. 158. . \ ' * 

*Ibid, p. 164. . . 

mid., p. 161. 

<>Ibid., p. 161. , N . 

Vbid., p. 165. 
«Ibid, p. 165. 
9 Ibid., p. 166-67. 

S 

10 Ibid., p. 169. 
"Ibid., p. 172. 

•*/&/</., p. 155. N 
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PART A: NEA ANALYSIS OF SAT DATA CONCERNING 
DIFFERENCES BETWEEN COACHED AND NONCOACHED STUDENTS 

World-wide, special schools exist to prepare or coach individuals to take the SAT. 
The^existence of these schools arid their effectiveness have been questioned and in 
1976 were investigated by the Federal Trade Commission. The purposed the,FTC 
study was to provide support for the contentionthat coaching does improve an 
individual test score on^he SAT and the Law School Admission Test (LSAT). 
Based on an analysis of SAT and LS^T data, the FTC reported in ,1978 that 
coaching was dramatically effective for the SAT and that the LSAT.was susceptible 
to coaching. A caveat issued by the Commission, however,- explained that some of 
* the conclusions in the study were not supported by evidence obtained from the 
investigation. s . v 

'Much of the concern giving rise to this warning. centered around the fact that the 
characteristics of the coached"and noncoached groups included in the study were 
not reasonably similar. -For example, the coached group ranked higher 'in high 
school, had parents with higher averjage incomes, included fewer black students, 
included students with better grades in English and math, and included a greater 
proportion of students from nonpublic schools (44.7 percent of the coached stu- 
dents attended nonpublic schools whereas only 24.6 pafcenT of noncoached stu- 
dents did so). Thus, score differences could not be att/ibuted tp coaching alone. 

In 1979 a revised FTC statistical analysis of SAT data was released. According tb 
this analysis, coaching was found to be effective for students who did not scovt well 
on standardized tests, but thd initial concern with group differences had not been 
addressed, because of its interest in testing and coaching, NEA Research requested 
and eventually received a copy of the originaLdata-tape used to generate the 1978 
and 1979 FTC analyses. NE^ Research then proceeded to duplicate the statistical 
analyses reported in the 1979' FTC -report. The question addressed in the NEA 
independent analysis of the FTC data tape was whether differences in S AT scores 
Could, be found after taking into consideration some of the possible differences 
among the coached and noncoached groups of persons (differences such as income 
level and high school grades). 

NEA designed two approaches for grouping persons included in the FTC data base. 
The first approach involved an effort to match persons in coached and noncoached > 
groups with respect to characteristics which might have an impact on SAT scores * 
and which might influence SAT score differences. Sex, age, family income, and high 
school standing were among the characteristics considered. Within the available 
FTC data base, this approach failed to yield' a sufficient^ of "rnatched pairs' 1 for 
•meaningful analysis. A second procedure was consequently used. 

The second procedure involved twp statistical techniques for grouping persons 
included in the FTC data baseband far assessing differences between SAT scores of 
coached and noncoached" groups. The first technique was a discriminatory aaalysis 
tp determine whether certairr characteristi<%ivailable for some students could be 
successfully used to classify individual imo coached. or noncoached categories. 
Given that certain characteristics could be identified, the se.co'nd technique used was 
-an analysis of covariance to determine "whether, on' tne average, a significant 
difference (better than 90 percent confidence) existed' between coached"an(Hon- 
coached SAT test scores after eliminating the linear effects qf^Hie characteristics 
included from the discriminatory analysis. * ' *, 

Given the second approach, fdur subgroups of high school students, also used in the 5 
1979 FTC report, were identified. With this analysis, coached and noncoached 1 
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students cpuld be grouped by gradf lev^l, year, and whether they hadtaJcen'theSAiT 
once or twice. Tabje 1 presents a descriptive summary of the'fbyur subgroups 
identified for analytical purposes. - 



Table 1: First- and Second-Time SAT Examinees Grouped 
by Grade Level, Year, and Coached Status 



-* High School Group * 


Coached 


Noncoached 


* 


rr- 


*\ " . 


SAT Examinees: First Time* 




Juriiors, t i975 


50 


' 451, . 


Juniors, 1976 




' , 487 


SAT Examinees: Sebond Tintfe 


, '« 


;•' 'X 


Seniors, 1975 * * 


s51 ^ c - 


305 


Seniors, 1976 


111 


324 i * 

. f 



In each of the above four groups there was a maximum of 24 characteristics in the 
data base. For example, parent's income level, high school rank, sex, a"rid latest 
math grade were reported for all four groups. On the other hand, Preliminary 
Scholastic Aptitude Test (PS AT) scores were reported for only the first two groups. 

The first of the statistical techniques employed considered the available characteris- 
tics associated with coached and noncoached sttfdents Within each of the four 
groups. The basic question which the technique asked was* Whicb of the character- 
istics, if any, can be used to differentiate a student as being -in< the coached -or 
noncoached category? Putting it another way: Are there certain characteristics* 
which* serve to classify a, student as coached of noncoached? 

If such characteristics could be identified, then the next question was: Taking into 
consideration characteristics which serve to differentiate between the <cpached and % 
noncoached students, is there a difference in the average SAT scores? 

* ' f 

Both of these questions are addressed in the material which follows, for each of the 
four groups separately. Then a summary is presented, based jipon findings for each 
group. 



Juniors, 1975, First Time Takers ^ * 

Fourteen variables were included in the analysis with all 14 allowing for 74 percent 
of the students being correctly classified as either coached or itfoncoached. When 
criteria for inclusion in the analysis were stipulated, 8 of the 1 4 were considered and 
*74 percent correctly identified. The -significant variables for discrimination, in th£ % 
order selected were: 

) ..• • • 



1. Parent's income level 

2. Latest math grade 

3. PS AT verbal score 

4. SAT verbal score 

5. High school class rank 4 
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6. High schoo} (public br private) 

7. PSAT math score 4 

8. SAT math store 

There were two analyses of covariance performed on the verbal SAT and the math 
SAT. 

Verbal Score— First Exam: 

Two analyses were performed, the first including the PSAT verbal score and the 
second excluding the PSAT verbal score (PSAT math sc<fre and SAT math score 
first exam were also excluded for this analysis). TJiere is a significant difference 
between average verbal scores of 'coached and noncoached at the t).02 level With 
the PSAT verbal score removed, there is still a significant difference between the 
mean values. ^ 

Math Score— First Exam: * > . 

The same two analyses were performed for the SAT math scores, with and without 
the PSAT math (SAT verbal and PSAT verbal were not included). The results are 
identical with the mean differences being significant at better than the WO 1 level in 
both cases. 



Juniors, 1976, First Time Takers 

Sixteen variables were included in the analysis with #11 fallowing for 72 percent 
correctly being classified asfoached or noncoached. When criteria for inclusion in 
the analysisAvere stipulated, 13 of the* 16 were considered and 77 percent correctly 
. classified. Of these 13 only the first 8 were selected for the covariance analyses which 
follow. The first 8 variables, in the order selected, were^: 

* 1. Parent's income level * ' 

* 2. SAT* verbal scote a % 

3. PSAT'verbal score 

4. Two PSATs taken before the first SAT 

5. ' Latest math grade • ^ 

6. • PSAT math score 

7. SAT math score 

8. Sex 4 - 

There were two analyses of iCovaYiance performed on the verbal SAT and math 
SAT. 

VerSial Score— First Exam: 

* ♦ 

Two analyses were performed, the first including the PSAT verbal and the second 
excluding it. There is a significant difference at the 0.01 level between the coached 
and noncoached average verbal scores both with and without the PSAT verbal 
being included. 
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Math Score— First Exam> ^ 

The same two analyses were performed for the SAT math scores, both with and 
♦without the PSAT math as a factor. With the PSAT math included, there is a 
" significant difference between coached and noncoached at the 0.01 level. With the 

PS AT math eliminated, there is no significant difference between the coached and 

noncoached average math scores at the 0. 1 0 level, although'average SAT scores for 

coached are higher than noncoached. 



Some 

difference found 



Seniors, 1975, Second Time Takers 

Fifteen variables were included in tfie analysis with all J 5 allowing for 72 percent of 
the students being porrectly classified as coached or noncoached. Seven variables 
accounted for the most variability and were selected in the following order: 

l./SAT verbal score-— first exam 



2. 

3. 

4. 

5. 

6/ 

7. 



SAT verbal score — second exam 
Latest math grade 1 ^ r 
Years of English in high school 
SAT math score — first exam 
Parent's income level 
SAT math score — second exam , 



, There were two analyses of covariance performed on the verbal SAT and math SAT 
where the second exam scores were to be compared. 

0 

Verbal Score'— Second Exam: 

<■> 

Two analyses wereperformed, the first including the SAT verbal score-( first exam) 
and the second excluding this variable. In both instances-there is a highly significant 
difference (better than 0.01) between coached and noncoached average grades after 
eliminating the linear effect of the appropriate variables above. 

Math Score— Second Exatp: . 

.The same two analyses were performed for the SAT matfi*score (second exam), 
both with ind without theJSAT math score for the first exam. With the first SAT 
exafn score included, there is a significant difference at better than 0.01. With the 
first SAT math score not included, the difference is significant at the 0.10 level. 



Significant 
difference found 



Significant 
difference found 
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Seniors, 1976, Second Time Takers 

When all 15 variables are included in the analysis, 73 percent of the students are 
correctly assigned to th£ coached and noncoached groups. With the first 6 variables, 
the same percentage^are properly classified. The 6 significant \ariables are: 

I. SAT verbal score— first exam. 
* 2. SAT verbal score— second exam 
3: Latest English grade ' - 

4. SAX ^ at h score — first exam 

5. SAT math score — second exam ^ 
' 6. Parent's income level 
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There were two analyses of covariance performed on the verbal S ATand math SAT 
where the second exam scores were compared. t 

Verbal Score—Second Exam: 

Two analyses were performed, with and without the SAT verbal-score (first exam) 
as a factor. With the SAT verbal score on the first attempt as a fartor, there is a 
significant difference between the two groups (better than 0.01). With the SAT 
verbal score (first exam) excluded, there is ho significant difference at the 0 10 level 
• although average SAT scores are higher than the noncoached. 

Math Score— Second Exam: 

The two analyses in this case yield the same conclusions, namely, with or withput 
the math score (first exam) as a factor there is a significant difference between the 
two groups at better than the 0.01 level. 

. Conclusions 

Based upon the analyses just described, the following conclusions can be drawn: 

1 . In examining the means and standard deviations for the variables in the 
analyses, it was evident that the averages for the coasbed group tended to 
be higher while the standard deviations tended to be lower than the 
noncoached group, suggesting not only higher levels of the variables for 
* the coached group"but also more homogeneity among the students within 
the coached group. 

> 2. In all instances, significant discriminations were noted between the 
coached and noncoached groups. From 73 percent to 77 percent 6f the 
individuals in the four groups were correctly assigned to either the 
coached or noncoached group. 

3. Of the 16 analyses performed on the differences between the SAT scores 
4r . after eliminating the effects of characteristics identified (see Table 2) 
significant differences exist in. J4 of the 16 cases. This finding strongly 
suggests that differences still exist in the average SAT scores, between the 
" coached and noncoached. In all instances, the differential SAT averages 
were still higher for the coached than the noncoached group. Parent's 
income level was fhe one characteristic (external to the.SATand PSAT 
scores) which appeared as a significantly discriminating variable in all 16 
analyses. 



25 



Table 2. Summary Results: Levels of Significance Between Coached 
and Noncoached Students, Classified by Grade .Level 
and Year for Analysis of Covariance 1 



STUDENT GROUP 


VERBAL SAT 


MATH SAT 


PSAT Included 


No PSAT 


PSAT Included 


No PSAT 


1st Time Takers: 

Juniors 1975 
Juniors 1976 


0.02 - 
0.01 v - 


.0.10 

si o.oi 


0.01 
N 0.01 

-— 


o.of 

N.S. 2 


— f* 

-2 ' 


1st SAT " 
' included.. 


No 1st 
SAT 


1st SAT 
Included 


* No 1st 
SAT 


Coached between 
J st and 2nd 
SAT? 

' iSeniors 1975 

Seniors J 91 p 
— i - ' 


0.01 i 
0.01 


1 o.oi' 

, N.S. 2 


0.01 . 
0.01 


0.M0 
0.01 



'For differences between average SAT scores to'be considered as significant, the require- 
ment was 0. IOjh less. ^Results are reported as 0.10, 0.02, and 0.01. 
2 Not significant, but average coached SAJ scores were higher than noncoached. 
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* PART B: SHOULD HIGH SCHOOLS COACH 
STUDENTS TO IMPROVE SAT SCORES? 




p 

ft 

* 


1 

How much 
difference does 
coaching make? 


* * 

The Federal Trade Commission's reports (1978, 1979) on the coachability of the 
SAT did not adequately answer the question, "Should high schools coach students 
to improve SAT. scores?" The statistical evidence and related conclusions concern- 
ing the coacKed group^described in the previous part of this section suggest that the 
answer should be affirmative. There is a morejmportant question, however, that 
remains unanswered. It is: "To what extent does coaching improve SAT scores?" 
NEA Research attempted to answer this question by analyzing the FTC data for 
those students who had taken the Preliminary Scholastic Aptitude Test (PSAT) 
once and the Scholastic Aptitude Test. (SAT) twice. 






.\ 


Method and Procedures 








Sample ./ ' 


• 


1 


i 

■ r 


The FTC data base was compiled for the. interval of October 1974 through 
December 1976. The base included 2,286 SAT coached and 1,777 FTC matched but 
uncoached individuals. The selection of students for the NEA analysis was based on - 
two criteria. First, only students who had taken the PSAT once and the SAT twice 
wofild be included in the sample. Second, coached students included in the sample 
would be drawn from the coaching school identified by previous FTC analyses to 
have produced the best results. 

Given these criteria, a sample of 1,324 students was drawn. Of these students, 625 
took the examinations in 1975, and 699 students took the examinations in 1976. 
Students were then placed into one of three groups. Group placement was based on 
whether students had been coached and, if coached, when coaching had 'occurred. 
Because students for the years 1975 and 1976 were considered separately, a total of 
six subgroups, three^subgroups for each year^were ultimately identified. The 
distribution of students for the six subgcoups was as follows: * 

1975 — 65 students coached between PSAT and £irst SAT 
— 105 students coached between first and second SAT 




ft 


- i 


— 405 noncoached students 

1976 — 118 students coached between PSAT and first SAT 

— 173 students coached betweeri first and second SAT 

— 408 noncoached students. 






* 

Jhrei researqh 
)fuestiom/ 

* 

* 


i 

Statistical Treatment 

. The purpose of the analysis was to determine individual growth of each group. The 
individual was used as her or his own control, and no comparisons among groups 
were made. The measure of growth was the average gain between the PSAT and the - 
second SAT. For each group, three questions were addressed: 

\. What was the average gain between the PSAT and the second SAT? 
■ 2. Was the average gain statistically significant? 
3. Was the average gain practically significant? 

♦ The design used for this study and its limitations are discussed extensively by 
Campbell and Stanley. 1 The design is commonly designated as the one-group 
pretest-posttest design or as a before-after study with a single group. The statistical 
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procedure used to measure- the average gain was a "t" test for correlated data. 
Means, standard deviations, and tests of significance were performed for each of the 
six subgroups on the SAT verbal, SAT math, and total SAT scores. g 

Tests of significance were performed on the verbal, mathematical, ,and total gain 
score for all six subgroups. Three groups for 1975 and three groups for 1976 were 
analyzed. 

Results 



Differences between PS AT and second SAT scores were found for each of the three 
subgroups: for 1975 and 1976. Differehces were found for all groups on the SAT 
verbal, SAT math, -and total SAT scores. Differences on all of the measures were 
significant at the 0.01 level. Tables 2lhrough 7, which appear at the conclusion of 
this section, present the average gain scores for SAT total, SAT verbal, and SAT 
math scores for, each of tjie three groups for 1975 and for 1976. for discussion 
purposes/ select data from Tables 2 through 7 have, been assembled for Table 1 
presented below. ' 



Significant 
difference found 



Tabic L Select Average SAT Poiijt Gain Scores for 
Coached and Noncoached PSAT and Two-Time SAT Takers for 

1975 and 1976 



Year 

1976 

1976 

1976 
1975 

1975 

1975 



N Total Verbal Math Average 
Average Average Average Family 
Point Gain Point Gain Point Gain Income 



Coached between 

PSAT and 1st SAT 143 

Coached between 

1st SAT and 2nd SAT 135 

Noncoached _ 60 

Coached between 

PSAT and 1st SAT 1 14 

Coached between 

1st SAT and 2nd SAT 104 

Noncoached 44 



73 

65 
29 

55 

47 
17 



70 

70 
31 

59 

57 
27 



$29,000 

26^000 
21,000 

30,000 

26,000 
20,000 



According to the data in Table 1, the 1976 grouBTcoached between the PSAT and 
the first SAT achieved the greatest total averagagain (143 points). The 1976 group 
coached between the first and second SAT had an average gain of 135 points, while 
the 1976 noncoached group had an average gain of 60 points. 

The 1975 average gains for all three groups were somewhat smaller. The group 
coached between PSAT and the first SAT showed an average gain of 1 14 points. 
The group coached between the first and second administration of the SAT showed 
an average gain of 104 points. The noncoached group average gain was 44 points. 

The average gain scores for all/three groups for both 1975 and 1976 suggest that 
taking the test three" times produces positive results. This was true for the verbal and 
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math scores as well as for the total scores. It is interesting to note that a relationship 
appears to exist between coaching and Average family income. 

Discussion ' 

The results of the analysis of the two coached groups further suggest an affirmative 
answer to the question, "Does coaching improve SAT scores?" This result is not 
really too profound if one believes that instruction works. Common sense would 
suggest that if one is'taught four hours a week for ten weeks, there should be on the 
average some positive results. The extent of the increases appears to be both 
statistically and practically significant. 

ETS on the other hand continues to hold to a position that discourages thfe use of 
coaching for the SAT. Jn a message released by ETS in the early months of 1980 
entitled Accountability, Fairness, and Quality in Testing, the following information 
was reported about coaching: " 

9 I , 

ETS has stated that our research shows a relatively small average 
gain in scores as an expected effect from short-term instruction. Bur 
this will have little, if any, effect on admissions decisions at most 
. ' institutions. For instance, ETS research (whiqh has bepn openly 
reported) shows the effect of coaching on the Scholastic Aptitude 
Test (SAT) ha<f typically been a gain of less than 15 points for the 
verbal section and of less than 20 points for the mathematics se^fti 
on the SAT scale, which ranges from 200 to 800 points. This corres- 
ponds to just two additional items correct per section. When the 
coachingAvas restricted to an explanation of the test's item formats 
and practice with them, research studies report smaller effects. 
Caching that incorporates formal instruction in the subject matter 
with special test preparation has b?en shown to yield somewhat % 
larger effects, perhaps,tfrree additional items correct rather than two. 
Finally, there is no research evidence to prove the claim that coach- • * 
ing can particularly benefit students from minority groups or stu- 
dents with low initial scores. Students who score low the first time 
they take a test often show greater gains upon retesting simply as a 
statistical artifact, without regard to coaching.* 



In its 1979 report the FTC found that "coaching was effective at Uje^two schools 
contributing on the average approximately 25 points to students' scores orfboth the 
'verbal and math SAT exams/' 3 the report' goes on to say that "the students who 
attended the effective school (School A) tended to be underachieves onstandard- ' 
ized exams, i.e., they scored lower on standardized exams than would have been 
predicted given their personal and demographic characteristics— grades jn school 
and class ranks." 4 ' ' 5 " 

s 

In brief, the "1979 report of the FTC concernihg coachability concluded that 
"underachievers on standardized exams" could on the average increase their score 
25 points on the verbal and 25 points on the math SAT exam with coaching from 
one of the two schools studied. * J 

NEA believes that if 25 or 50 points (on a 200-800 scale) can make the difference in 
an admission decisi6n to an undergraduate or graduate school of the student's 
choice, then parenfS, students, anfoducatorsjnust decide whether the outcome is 
worth the $^00, $300, or even a $500 coaching expenditure. This type expenditure 
can be afforded by some parents, but'not all families ha vethe wherewithal to invest 
in a, coaching school. - 
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Furthermore, NEA believes all students should have an opportunity to receive 
coaching free of charge. Unfortunately, the groups (i.e., lower socioeconomic, 
minorities, and women) who have historically lost the most in achieving opportuni- 
ties for continued higWfr education are the same groups which score lower on the 
SAT. The consequences of the lower test scores in many instances have led to the 
perpetuation of discrimination against these groups. 

The conflicting signals coming from ETS and CEB, along with the FTC's unwilling- 
ness first to publish reports on the coachability question in a timely manner and 
then later to turn over the necessary data to permit an independent analysis, have 
left the estimated 1 .5 million 1 979-80 student SAT-takers with no real answer on the 
coachabjlity question. Consequently, the status quo has been maintained except for 
the student SAT-takers in New York state. The state's truth-in-testing law requires 
disclosure of results of the individual student's test, answer sheet, and related right. 0 
or wrong response. 



NEA supports 
coaching 



Coaching and College Admission 



The New York truth-in-testing legislation helped escalate the issue of, the SAT's 
coachability. In response to the law and the coachability question, the College 
Board developed a series of press releases and information memoranda about the 
coachability of the SAT. These memos* released by the College Board in the fall of 
1979, further confounded the issue* For example, according to Stephen H. Ivens, 
'director of prograip services at the College Board, the "SAT measures Teas', ing 
abilities which are developed over time.both in and out of school." 5 Ivens goes on to 
say, "If verbal and mathematical reasoning can be learned % we assume that they can 
be taught, directly or indirectly." 6 In contrast to this observation, the coaching 
schools make subtle and blatant claims about their ability to develop these mental 
operations. In addition, they speak to test-taking skills that include efficient use of 
time limits, methods used to answer questjpns, and techniques for successful 
guessing. . 

Unfortunately, these skills and mental abilities are held to be importantly college 
admissions personnel. Although most admissions officers claim that the SAT and 
LSAT are not used, by themselves, to admit or reject an applicant, v^cy few deny 
that the scores play ati important part in the admissions process. 

In an effort to learn how important SAT scores were for college admission, NEA 
queried a representative sample of eight universities about SAT score admission 
requirements (See Table 9 at the end of this section). One university indicated that 
there was no minimum for the verbal and math score; frbwever, if the total was not 
1,000 (out of a maximum of 1,600),* there was no need to apply. Two. other, 
institutions had ranges of acceptable scores (i.e., 500-800, 450-600) for .both the 
verbal and niath ar well as 'the total score. The remaining five schools all had 
minimum scores for admission on all parts and the total. 

When this information about admission requirements and the results of the FTC 
data afialyses on the coachecj^nd noncoached students who took the PSAJ and 
SAT twice are assessed, recommendations about coaching'become obvious. The 
average total gain scores for the 1976 groups were Group A, coached between the 
PS AT and first SAT, 143; Group B, coached between the first and second SAT, . 
135; and Group C, noncoached, 60. A,summary of the average gain on the three 
groups' verbal, math % and total SAT scores appears in Tabled at the end of this 
section* 



CEB response 
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When the results of the average gain effects are compared to the SAT admission 
requirements of the eight universities shown in Table 9, the folfowihg information is 
revealed: * " 

V 

• Groups A, B, and C on the average would not be eligible for admission to 
any of the eight universities, based- on the average converted PSAT 
scores. \ 

• Groups A and B on thej^yerage would be eligible for admission to four of 
the eight universities, based on the second SAT average scores for the 
verbal, math, and total. \ 

• Group C on the average would be eligibly for admission to one of the 
eight universities, based on second SAT avterage scores for the total and 
subtotals. \ 

The differences between becoming eligible for admission to one versus four of the 
eight representative universities leaves little doubt that on the average coaching 
does have a positive effect not only on improving the scores but bn increasing the 
possibility of being favorably considered for admission. 

Conclusions > 

A statistical analysis was performed on six subgroups, three for 1975 and three for 
1976, to determine individual growth on the SAT given that each subgroup had 
taken ther PSAT once and the SAT twice. Three questions were posed: 

Jjt What was the average gain between the PSAT and the second SAT? 

2. Was the average gain statistically significant? 

3. Was the average gain practically significant? 

The answer to eScK^JTlfiese questions based on the statistical treatmeiffbf the data 
follow. 

What Was the Average Gain Between the PSAT and the Second SAT? 

Coaching imnmvix " Central t0 this question is the issue of whether a student can improve on the SAT if 
SAT scores he ° r She getS coache<L A N of the coached students spent at least four hours a week 

for ten weeks in a SAT coaching school. The conclusion reached was affirmative. 

The students who were coached made average gr6up gains which were greater than 
the average gain made by the uncoached group. - 

Was the Average Gain Statistically Significant? 

The analyses of each of the groups on all of the measures produced a statistical 
Score gains are sta- difference at the 0.01 level for each of the three subgroups. This essentially means 
tistically significant that it can be stated with 99 percent confidence (for each of the groups studied, 

coached and noncoached) that there was a difference between the PSAT and the 

second SAT scores for alkthree groups. 



Was the Average Gain Practically Significant? 
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A comparison between a pre and postmeasure may be statistically significant; 
Practical signifi- however, the practical implication or consequences ofchanging the conditions (e.g. , 

cance prompts other a new reading program requiring texts, in-service training, etc.) may not be possi- 
; questions ble. In some cases, doing what is currently being done in an improved manner may 
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provide just enough change to produce nonsignificant difference if the study were 
replicated. Practical significance becomes a question of a subjective value. For 
example, if you have $300 1 a $500 to spend on a coaching school, the answer would 
hj yes; however, if you don't, the answer Would be no, A related and equally 
important question concerns how things are working in the real wortd. If equal 
educational opportunity is in place and everyone is given the same opportunity for 
an education, then the answer may still be yes. On the other hand, if the way things 
are working systematically discriminates against certain groups, then the answer 
may be no. FurthermoFe, a basic question about the test's(SAT) predictive validity 
in college and life has not been answered. . * 



Question of SAT's Predictive Validity 

There was an extensive analysis and reporfon the SAT recently conducted by the 
College Board which treated the question of the SAT's prediction capacity. In the 
discussion a'bout how well the SATs predict college success, it was reported that 
"high school grades* are still the best single predictors of college performance; but 
when .these grades are combined with SAT scores more accurate prediction proved 
possible." 7 The' critical point always excluded when the College Board or ETS 
report on the SATs predictive, validity is a consideration ofxrther variables or 
methods that can be teed other than the SAT to improve tfife prediction of college 
success. A related issue is how much of what is being measured by the SAT is 
already contained iff the grades given by teachers to the students. 

An analysis of this concept, which was reported in NEA's Analysis of the Wirtz 
Report on Declining SAT Scores (see Appendix A for an executive summary), is . 
repeated'here to demonstrate what is meant. (In the following excerpt, HSR refers 
to high school record.) 

A statistic used to interpret a correlation coefficient (e.g., validity 
coefficient) is the coefficient of determination. The coefficient of 
determination is symbolically represented by r 2 (or correlation coef- 
ficient squared); and when the coefficient is multiplied by 100, a 
percentage of the variance in the two measures is determined. In the 
case of the HSR and first-year college grades^where the validity 
coefficient was 0.5, the coefficient of determination (r 2 ) would be (0.5 
« 0.5 = 0.25) 0.25. When this value (0.25) is multiplied by 100, the 
result is 25, or 25 percent of theexplained variance. This percentage • 
% of variance (25) is interpreted to be the percentage of variance 
* associated with the first-year college grades that can be determined 
by, or accounted for, in the variance of the HSR. 

In the 0.5 illustration, the validity coefficient of 0.5 provides the 
percentage of variance in college gradeythat is accounted for by the 
variance in high school grades, which was 25 or one-fourth. By using 
only the high school grades, 25 percent of the variance can be 
« accounted for and 75 percent needs to be explained. Generally, this»„^ • 

75 percent is attributed to individual motivation and the institution^^ 
t personnel who work to develop the student. This unexplained var- 
iance is the factor that makes the difference in a successful college 
# • and life experience. ^When the SAT Verbal and Math scores were 

combined with the HSR (1974), the multiple r was computed to be ( 
0.58. . 

The same statistical principles apply to^a single or multiple 
• correlation.' ^ 

Basically, a coefficient of determination may be used to determine . 
the amount of variance explained by two or more measures corre- 
lated with the criterion. In the illustration the criterion measured is 



SA T's predictive 
power questioned 
further 
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colle'ge gr^es, Tl^e combined, HSR, $AT Verbal, and SAT Math 
weft correiated\wrth the first-year college grades and the computed 
multiple corret^yin (R) was 0.58. 

The coeipcient of determination (R*) for a multiple R of 0.58 com- 
putes to bet).34. When 0.34 is multiplied by 100, the percentage of the 
variance explained, in the three measures (HSR, SAT Verbal, and 
SAT Math) is 34percent. This leaves two-thirds of the variance in 
first-year college grades unaccounted for by the three measures. 

Another way of viewing the increased accounted variance in absence 
oLan established causal relationship between the measures used for 
prediction and the criterion measure (first-year college grades) is to 
evaluate how much more each measure adds to the prediction. 
For example, the 1 974 ETS's Validity Study SeYvice (VSS) provided 
the basic data about the validity coefficients of the HSR, SAT 
Verbal, SAT Math, and the combined multiple correlation. These 
and the coefficient of determination follow. 

r ^ 

•* s Computed V between 

predictor (SAT scores) aiuT Coefficient 
Measure criterion (college grades) of determination 

HSR 0.50 25% 

SAT Verbal 0.42 18% 

SAT Math 0.39 - 15% 

Combined 0.58\ * * 34% 

As can be seen from this display, thi cotplb'ined multiple correlation 
of 0.58 produces the greatest percentage of explained variance fol- 
lowed by HSR. However, the HSR not only provides for the most 
explained variance among the measures, h also accounts for much of 
what is being measured in the SAT Verbal and Math tests. This is 
revealed by .the relatively small increase of variance in the combined 
(multiple r) coefficient^ determination. The increase of 9 percent 
(25 percent to 34 percent) of variance accounted for in the combined 
measure suggests that not only is the HSR the single best predictor of 
* first-year college grades, it also provides the same information that is 
tested for on the SAT. If this were not the case and if each SAT 
section were actually measuring something unique, there would be a 
greater relationship (multiple r) among the combined .measures. 

It appears that the validity coefficients produced in the VSS study of 
the 783 colleges requesting the ETS service do little to justify the use 
of the SAT as a predictor of first-year college grades. Further, to 
npke a generalization about all colleges or to attempt to justify the 
construction of the SATbased on the relatively low reported validity 
coefficients raises some serious questions about whether it is the 
public interest or ETS's interest that is being served. 

The reported -information indicates that the panel answered the 
question about the SAT's reliability and attempted to answef the 
question ajjjqut predictive validity. 

The more significant validity questions about construct validity (the 
underlying'theoretical basis of what is actually being measured by the 
instrument, 'combined with supportive statistical and logical data 
(irom research studies) and content validity (which related to the 
content currently being taught in the schools) were not adequately 
investigated or at least not reporte^. 

To use the concept of an "unchanging standard" and to begin to 
investigate the changes in schools and society for 25, 20, or even five 
years do not suggest that the most objective approach was used to 
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evaluate the decline in the* SAT scores. It appears that what was 
examined is the SATs viability to continue in its present form as a 
source of ETS revenue. Furt hermo re , the pan el served as a buffer to 
make this determination. There is no mention in the report about 
whether the SAT should be discontinued or even modified. 
The CEB paid for and produced a report that did not question the ^ 

validity of the use of the SAT in contemporafy soci ety . Inst ead, i t 

was used as a criterion rfleasure or an "unchanging standard. n It 
Should be noted here that the NEA requested that the CEB panel 
investigate the validity of the continued use-of the SAT. 8 

SAT as the National Curriculum 

One of the major conclusions that emerged from the Wirtz report is that the SAT 
was viewed as an "unchanging standard. " It was portrayed as a universal truth 
designed to be used as a criterion to judge students' ability to succeed in college. 
Furthermore, the report went to gr^at length to assure educators that, -as an 
"unchanging standard" the SAT allows comparisons to be made with graduating 
seniors 5 or 20 years ago. As a constant measure, the SAT, according to ETS, 
ensures that u any particular score received on a current test indicates the same level 
of ability to do college work that the same score did 36 or 20 or 5 or 2 years ago." 9 

The Wirtz report begged the question of how a test created in 1926 and normed and 
scaled in 1941 can remain valid through an era of curriculum changes in mathenfat- 
ics and science not to mention the changes in world boundaries and ideology. 
Furthermore, the belief that the SAT is an Unchanging standard " in a society that 
had evolved through the third industrial revolution (i.e., cybernetics) and three wars 
suggests that the motivation and desire for the "good old days" was stronger than 
the desire for an objective analysis of the SAT. 

There may be another motivation, subtle and unspoken, for defending the SAT as 
an unchanging standard; namely, the desire to maintain the SAT as a surrogate for 
a "national curriculum." .It is a test that is taken by about 1.5 million students 
annually. Although the College Board and ETS stress that the SAT should not be 
used to judge school programs, teacher performance, or student progress, it is 
frequently used by journalists, politicians, and even educators as a quality measure 
of education. 

The results on the coachability issue will certainly increase the number of school 
districts with coaching courses in the high school. This in turn will further the 
argument that the SAT is the first course of study that will be taught in the country 
and, therefore, it provides the precedent for other courses to be used in the 
development of a national curriculum. 

Alternatives to Coadiing for the SAT 

There are many viable options to the SAT (and cqaching for it) available for use, 
given the reduced number of high school graduates projected between now and 
1995. It is estimated that there will fre 400,000 fewer high school graduates (2.8 
million to 2.4 million) between 1984 and 1987. Given this reduction in graduates 
and the existing available higher education classrooms, it would seem that the time 
has come to examine alternative selection methods and techniques. It is not 
necessary to coxitiiiucto use an outdated and "unchanging standard" in a dynamic 
society with & diverse and creative youth population that deserves better than to be 
subjected to a 2\A hour paper and pencil test to become eligible for consideration to 
a college or university. 

O - 
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Academic Prediction Scales 



Academic prediction scales have been available for use for over 20 years. These 
scales are genera y designed to improve the ability of high school grades to serve as 
predictors of colfege ,ua»s. According\o B.S. Blg>m and F.R Peters, studies 
have shown that with grade adjustmerits &tween hf h school grades and college 

o 1.1^" S m f Ch ^ 1?VeI ° f +0 7 ° ^ V rtd in some particular schools 
or colleges the correlation is as high as +0.85. *It is not difficult to conclude that 
hese correlations compare very favorably to aptitude an^ achievement test correla- 
tions,, which fall in the range of 0.28 to 0.42. 

With such data available for many yeaTs, the question of its lack of use ha/been 
answered with the explanation that there were- too many students applying for 
admission. It is apparent that with declining enrollments, this explltiaHon " 
inappropriate. This observation along with the evidence that the&Tdiscriminittes 
againsf minof, ties, lower socioeconomic groups, and women suggest that an alter- 
TZ &PP r*? Sh ^ U, i b ? USCd - The - a <*demic prediction s5es described by 
CnZSS h , °u ffer ^ leaSt ° ne viable a PP^ach. This would give time to the 
College Board and the researchers at ETS to develop more objective and equitable 
measures that would treat everyone in a just manner. ' 

T^e nation's educational objective must be to fit* the desires, ambitions, and 
developed abilities of every student, who wishes a college education to the most 

?11? X a , CUm f Ul6m - 11,15 mU V be 016 a PP roach i f we are t0 8 ive every child an 
-equal educational opportunity. \ ' 
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Table 2. 
Uncoact 



iverage SAT Total Point Gain for Coached and 
ed PSAT" and Two-Time SAT Takers for 1975. 



Coached Between 
PSAT and 
1st SAT 

(N=65), 



Total Scores: 
PSAT 

SAT — 1st 

SAT — 2nd 

Average gain 
between PSAT 
and 2nd SAT/ 



Coached Between^ 
~u 1st SAT and 

2nd SAT 

.(N=105) 



Noncoached 

,(N=455) 



(A) 


4*h 


• , (C) - 






4 


940 


930 


890 * 


1,022 


- ' * , 965 


913 


1,054 


; 1,034 


934 


114 


v 104 


44 



Table 3. Average SAT Total Poini Gain for Coached and 
' TJncoached PSAT' and Two^Time SAT Takers for 1976. 

Coached Between 

PSAT and ' 
* , 1st SAT 
(N=118) 



Coached Between 
1st SAT and 
■2nd SAT 
(N=173) 



Noncoached 
(N=40£) 





(A)* 


(B) 


(C) 




y^— 






Total Scores: 








PSAT 


920 


. 930 - 


900 


SAT — 1st * 


1,016 • 


•„ 971 


926 


SAT*— 2nd v " % 


. 1,063 


1 ' 1,065 


960 . 


Average gain ♦ 
between PSAT 
and 2nd SAT 


143 


13 5 °°' 





The PSAT is described as a shortened version of the College BoaFdVSAT. It yields'2 

scores, verbafand mathematical, on a scale of 20*80 and is directly comparable to the SAT 

score scale of 200-800. t 
0 
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fTrtlf 4.' Average SAT Verbal Score Gain for Coached and 
v Uncoached PSAP and Two-Time SAT Takers for 1975. 

Coached Between Coached Between 
PSAT and 1st SAT and * 

• - l a S £? 2nd SAT Noncoached 
< N=6S ) (N=10S) (N=4SS) 

(A) ' (B) n (C) 



Total Scores: 

* PSAT J 450 
SAT - 4 8 6 



450 430 
467 436 



SAT -2nd 505 , 497 • ' 447 

Average gain 
between PSAT 

and 2nd'SAT 55 ' 47 J? 



Table 5. Average SAT Verbal Score Gain for Coached and 
Uncdached PSAP and Two-Time SAT Takers for 1976. 



Coached Between Coached Between 
PSAT and 1st SAT and 

!2 ? A I 2nd SAT Noncoached 
(N=118) » (N=173) (N=408) 

< A > (fV (C) 



430 
445 
459 

I 
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'2? PSA V S , de j cribe , d as a shortened version of the College Board's SAT. It yields 2 



Total Scores: 






**P5AT 


'440 


440 


SAT - 1st 


• 496 


462 


SAT — 2nd 


' 513 


505 


Average gain 






between PSAT 






and 2nd SAT 


73 


65 
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Table 6. Average SAT Math Score Gain for Coached and 
Uncoached PSAT 1 and Two-Time SAT Takers for 1975. 





Coached Between 

PSAT and 
1st SAT 
(N=65) 


Coached Between 

let SAT onH 

2nd SAT 
(N=105) 


Noncoached 
(N=455) 


• 


(A) 


(B) 


(C) 


Total* Scores: 
PSAT 


490 


480 


460 


SAT — 1st 


536 


* 498 


477 


JSAP — 2nd 


549 


537 


487 


Average gain 
between PSAT 
and 2nd SAT 


59 


57 


27 


Table 7. Average SAT Math Score Gain for Coached and 

Uncoached PSAT 1 and Two-Time SAT Takers for 1976. 

■* * 




Coached Between 
PSAT and 
1st SAT 
(N=118) 


Coached Between 
1st SAT and 
2nd SAT 
(N=173) 


Noncoached 
(N=408) 




(A) 




(C) 


Total Scores: 
PSAT 


* 

480 


490 


470 


SAT — 1st 


520 


509 


481 


SAT — 2nd / 


550 


560 


501 


Average gain / 
between PSAT 
and 2nd S/cT 


70 


70 


31 



'The PSAT is described as a shortened version of the College Board's SAT. It -yields 2 
scores, verbal and mathematical, on a scale of 20-80 and is directly comparable to the SAT 
score scale of 200-800. 
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Table 8. Average SAT Gain Scores for Coached and 
Noncoached.PSAT and Two-Time SAT Takers on the Verbal, 
, -Math, and Total Scores in 1976. 



GROUPS 


VERBAL 


MATH . 


TOTAL ^ 




PSAT 2nd SAT 


PSAT 2nd SAT 


"PSAT 2nd SAT 


Coached (A) 
between PSAT 
and 1st SAT 

******* tot fcJ 1 


440 513 


480 550 


920 1,063 


Coached (B) 
between.. 1st 
and 2nd SAT 


440 505 


490 560 \ 


•930 1,065 


Noncoached »(C) 


430 45? 


470 501 


900 . 960 ' 



\ 



Table 9. Select Universities' SAT Admission Requirements for 
Academic Year 1980-81 



University 


Verbal 


Math 


Total 


Harvard \ 


500-800 


500-800 


1,000-1,600 


Yale ' >V 


670 


680 


1,350 


Pennsylvania 


650 


660 


1,310 


Columbia 


j*- 650 


660 , . 


1,310 


Pennsylvania State 


' 450-600 


450-6oi 


900-1,200 


Emory 


550 


.600 


U5q'\ 


Rutgers 


4?0 


540 


■1,030 V 


George Washington 






1,000' 


Average 1979 SAT 








Score for College- 








Bound Seniors 


427 


467 . 


894 
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t 4T 

V 



V 



1 



BI- 



SECTION III: COMMERCIAL INVOLVEMENT IN 
STATEWIDE TESTING PROGRAMS 
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Statewide testing programs exist in nearly every state in the union* in the District of 
Columbia, and in Puerto Rico. There is great variation amonglhe programs with 
respect to policies and procedures. There are also similarities among the programs 
such that they can he described according to any one of a number of features.* 

* 

A-nujnber of features have been surveyed and described by Educational Testing 
Service (ETS), In 1968 and again in 1973, ETS conducted surveys of individuals £73 surveys 
kriowleclgeable of the state testing programs in their respective states. The purpose ' 1 
of eacji survey was to gatherlnformation useful for constructing a profile of the 
state testing program. In 1968, state profiles were prepared for the areas of func- 
tions, tests, materials, and services. 1 In 1973, ETS areas of interest were program 
purpose, management, test population, instrumentation, data collection and pro- 
cessing; ndrms, information dissemination, and future prospects. 2 

"4, 

The ET$ profiles provide' information useful for describing the various programs 
and for identifying their similarities and differences. The usefulness, however, is 
limited in several ways. In some respects the 1973 data are incomplete. For example, 
of the* 33 states reporting statewide testing programs, only 28 identified the tests 
they used. 

From a different perspective, the data are too general to suggest immediately 
practical uses. For example, instrumentation, i.e., tests, Ts of particular interest to 
teachers 0 knd curriculum specialist*, who are probably in the best position to f 
determine whether a match exists Between What is taught and what is tested. The • 
1 973 ETS survfe^ focused on only fou&aspects of instrumentationfareas tested, tests 
used, whether irteasijres had been tailored or revised for state use, and who devel- 
' oped tailored tests if they were used. The data were reported generally; specific tests 
were not reported by state, and developers of tailor-made teJts were not identified. 

The 1973 survey information is also dated. Because of the tremendous growth in 
testing during the past few years, a number of changespan be expected to have taken 
place during* the past seven years. * i, * v 

For the reasons given above, NEA Research proceeded to update and extend the 

1973 ETS suryey and to focus primarily on the commercial aspects of instrumenta- NEA survey 

tion. In particular, the J^lEA update was designed to answer three questions: questions 

n How many states currently conduct statewide testing programs? 

In how many statewide testing programs does commercial involvement . ^ 



f 



exist?. 

Which tests and test 
programs? 



velbpers are involved in- "statewide testing 
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Definition of Terms 



For survey purposes, statewide testing program means one that applies to all public 
school districts in the state. The program may be administered through the state 
department of education or through a state university which may function as the 
testing bureau for school districts. Testing m^y be required of all designated 
students, may proceed on a random sampling basis, or may occur on a voluntary 
basis. < , 

* « * 

Commercialfinvolvement means programs where test content is determined wholly 
or in part by publishers or by cbnsultants-whose services are purchased. Consultant 
services means assistance with test design or test development and is restricted to 
services' involving test content such as developing test items or item pools, item 
validatiGn, or test construction. The ternv excludes services provided by a state 
university where the client is the university^ home state. 



Survey Procedures 

State department of education officials were contacted by telephone during the 
week of Octotar 12-16, 1£79. Individuals contacted are identified by state in 
Appendix A. Allnelephone contacts were made by one person. No interviews were 
recorded, and no systematic follow-up procedures were used to verify intemjjw~ 
content. Officials in the 50 states plus the District of Columbia and Puerto Rico 
were included in the survey; therefore, the total number of states in the survey is 
reported as 52.. 



Results of the Survey 1 

* 

The states currently conducting statewide testing programs are identified in Table 1 . 
According to the table, 9 states (17 percent of all identified state§) reported having 
no testing program; 43 states (83 percent) reported having a* statewide testing 
program. This represents an increase of 8 states ( 15 percent) with statewide testing 
programs since the ETS survey in 1973. . - 

Data describing the degree of commercial involvement in statewide testing pro- 
grams are also reported in Table 1. For survey purposes, the degree of involvemeht 
included the categories of "None" (no commercial involvement), "Items" (use of 
selected test items only), "Full Test" (use of one or several complete tests), 4, Consul- 
tants"(useof comercial consultants) and "FullTestand Consultants." According to 
the data, 3 of the 43 states with statewide testing programs^percent) reported no 
commercial involvement. The remaining* Sti'tes With testing programs (40 or 93 
percent) reported some cotfimerrfal involvement. Two of the^tfr states (5 percent) 
used only selected test items; J 2 states (28 percent) used full tests only; 13 states (30 
percent) used consultants only; and 13 states (30 percent) used both full tests and 
consultantservices. Five of the 43 states (12 percent) offered but djd not require the 
use of specific tests or test items. Four states (9 percent) involved both commercial 
and sWle consultants in their state testing programs. Because the 1973 ETS survey 
did, not collect commercial involvement data, it is not possible at this time to say 
whether the 1979 data indicate increased commercial involvement in statewide 
testingjttograms. 

Specific tests usee! and reported by stat8wjfficials are identified in Table* 2 by 
publisher, test title, and state. According to the table, McGraw-Hill is the most 
frequently used publisher by state programs. The'company also offers the greatest 
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variety of tests as indicated by test title and official report. The data~fcannot be 
f construed to suggest the number of students taking each test nor the local and state 
'cost of administering the tests. * 

The consulting firms reported to be associated with statewide testing programs * 
appear in Table 3. The table identifies 17 consulting agencies and the states in which 
services were rendered. Educational Testing Service (ETS) and National Evalua- 
tion Systems (NES) are the most active consulting firms in statewide testing 
programs with ETS serving 7 states arid NES serving 8 states. A summary of specific 
consultant activity is reported by firm and state in Appendix B. * 

Summary and Conclusion 

The purpose of this survey was to updatethe instrumentation focus of the 1973 ETS 
survey of statewide testing programs and to extend that/ocus to include commer- 
cial involvement. Based on the reports of 52 selected state officials, survey findings r 
-indicate that most states (83 percent) IjaVe a statewide testing program and that 
many of these states (77 percent) have some form of commercial involvement. 
States tended to use either commercially prepared tests or the assistance of consul- 
tants, although 11 states' reported the use of both. 

Based on these data, it is reasonable to conclude that considerable commercial 
involvement exists in statewide testing programs. It is also reasonable to conclude f 
that asmall number of publishers and consulting firms influence the content of tests 
uSed in the various state programs. • 
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Tabic 1. 1979 Summary of State Testing Programs 
'and Commercial Involvement 



State 



Testing Program Degree of Commercial Involvement 

Full Test Consultants Full Test and 
No Yes None Items* Only Only Consultants 



\A la llama . 

m Arizona 
Arkansas . 
California 

Colorado^ 
Connecticut" \ 
Delaware 

District of Columbia* 
Florida 

Georgia 
Hawaii 
Idaho 
Illinois , 
. Indiana • 

Iowa 
Kansas 
Kentucky 
Louisiana 
Maine t 



X 
X 



X 
X 



Maryland 
Massachusetts 
ichigan ( i * 
inncsota 
Mississippi • 

Missouri 
Montana 
Nebraska 
Nevada % , 
New Hampshire * 

New Jersey 
New Mexico; 
New York 
North Carolina 
North Dakota 

Ohio 

Oklahoma 
Oregon 

Pennsylvania ' 
Puerto Rico „ 

Rhode Island 
South Carolina 
South Dakota 
Tennessee . j 
Texas ' ^ ^ 

Utah-* " 
Vkginia 
Vermont 
Washington * 
West Virginia 

Wisconsin • - 
Wyoming 

Total 



X 
X 



X_ 
9 



X 
X 
X 



X 
X 
X 
X 

X 
X 
X 
X 



X 
X 
X 

X 

X . 
- X 

x ^ 

X J 

X 
X 
X 
X 

X 
X 
X 
X 
X 



X 
X 
X 

X 
X 

X 
X 

X 
X 

X 
X 



43 



X 

x. 



X* 

X 



X* 

X 



X 
X 

X* 



X 

X, 



12 



X 
X 



X 
X 

X 
X 



X 
X 



X 

X 
X# 



f 



x# 

X 



x# 

X 



X 



13 



X 
X 



X 1 

J3 • 



*Tests*or items offered but not required. 
#Both commcrciatand state university consultants are used. 
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Table 2. 



Publisher and Test 



1979 Summary of State Testing Programs 
by Publisher, Test, and State* 

State 



• American College Testing Program 

I. Adult Performance Level Exam 

• College Examination Board 

1. Degrees of Reading Power 

2. Degrees of Writing Power 
^Preliminary Scholastic Aptitude Test (PSAT) 

• ifyrcourt, Brace, Jovariovfch 

X Metropolitan Readiness Test 
v 2. Metropolitan Achievement Tests 

3. Otis Lenon Mental Ability Test 

4. Stanford Achievement Test 



• Houghton-Mifflin 4 

1. Cognitive Abilities Test 

2. Iowa Test of Basic Skills 

3. Tests of Achievement and Proficiency 

4. Test* of Academic Progress 

• International Business Machines (IBM) 

1. SRA Achievement Series 

2. Iowa Tests of Educational Development 

• McGraw-Hill 

I. California Achievement Test 



2. c Carecr Maturity Inventory 

3. Comprehensive Test of Basic Skills 



Diagnostic Math Inventory 
Everyday Skills Test 
Prescriptive Reading Inventory 
Senior High Assessment-«apfteading 

Performance (SHARP) B 
Short Form Test of Academic Aptitude 
Test of Performance in Computational Skills 

(TOPICS) 



• Psychological Corporation . 

1. Differential Aptitude Test 

2. Test of Academic Skills 

• Teachers College Press (Columbia) 

I. Cognitive Skills Assessment Battery, 



N.M. 

Conn., N.Y. 

N-.Y. 

Minn. 



D.C. 

Tenn. 

Hawaii 

Ariz., Hawaii, Nev., 
Tenn.* 



Mo.*, W.Va. , 

Ga., Iowa*, Md., tfD.* 

Ga. 

Mo.* 

N.D.*, Va. 
Iowa*, Minri.* 

Ala., Del., Ky. Md., 
Miss., N.C., Tenn.*, . 
Wash., 
Tenn. 

D.C, Ky., N.M., 
S.C, Utah, W.Va., Wis. 
•Ky., N.C. 
D.C. 

Ky., N.C. 
N.C. 

Ala., Miss. 
N.C. 



D.C, Hawaii 
Minn.* 



S.C. 



•Tests are offered but not required. 



9 

ERIC 



45 



Table 3. 1979 Summary of Commercial Involvement 
by Consultant Firm and State 



Consultant Firm State Client 



1 

1 . 


American College Testing*Program 
(Washington, D.C.) 

* 


\ — 

Nev, 


2. 


American Institutes for Research (Palo Alto* Calif.) 


Mich. 


3. 


Bozler Educational Consultants (Lincoln, Neb.) 


N.H. 


4. 


Educational Testing Service (Princeton, N.J.) 


Ala., Ga., Minn., Nev/f * 
N.J., P.R., Tex. J 


C 

J. 


Institute f° r Behavioral Research and Creativity 
• (Salt Lake -City," Utah) 


Utah 


6. 


Instructional Objectives Exchange (Los Angeles, Calif.) 


Va. 


7. 


Intran (Minneapolis, Minn.) 


La. 


8. 


McGraw-Hill (New YoiTc, N.Y.) 


6,c. 


9. 


National Evaluation Systems (Amherst, Mass.) 


^>nn., Ga., Hawaii, ^ 
Md., Mass., N.J., 
K.I., va. 


10. 


National Testing Service (Durham, N.C.) 


Del., La. 


11. 


Northwest Evaluation Association 


Wise. 


12. 


Northwest Regional Labofa|ory (Portland, Oreg.) 


Alaska, Idaho, Oreg. 


1 J. 


Research Management Corporation (Portsmouth, N.H.) 


XT T I 

N.H. 


14. 


Research Trikngle (Raleigh, N.C.) 


111., Maine 


15. 


Scholastic Testing Service (BensenVille, 111.) 


N.C, Tenn., Va. 


16. 


Science Research Associates, 

International Business Machines (Chicago, 111.) 


Mo. 


17. 


Touchstone Applied*Science Associates 
(Elmsford^N.Y.) . 


Conn., N.Y. 



, FOOTNOTES 



1 Educational Testing Service. "State Testing Programs: A Survey of Functions, 
Tests, Materials, and Services/ Princeton, N.J.: The Service, 1968. (TM003001) 

2 Education'al Testing Service. "State Testing Programs: 1973 Revision." Prin- 
ceton, N.J.: The Service, 1973. (TM 003 397). 



SECTION IV: NEA POSITION ON TESTING 



Historical Background 

In February 1973 the National Education Associatidn Center for Human Relations 
held a national conference in Washington, D.C. The theme of the three-day 
conference was "Tests and Use of Tests— Violations of Human and Civil Rights." 
The objectives of the conference were: 

• To examine current attitudes about the educational value of standard- 
ized tests, especially as they affect the culturally different learner. - ' 

• To explore alternative measurement and evaluation processes that would 
be helpful tools in the education process. V 

• To create greater national awareness of the need for concerted actiorfto 
prohibit the use of test scores as indicators of growth potential, especially 

- • for the culturally different learner. * *« 

One major outcome of the conference was a recommendation to the NEA Reprei 
sentative Assembly concerning standardized tests. Meeting that summer, the 
v Assembly adppted Resolution 72-44 on "Standardized Testing." The Resolution 
_ encouraged the elimination of standardized group tests of intelligence, aptitude, 
and achievement until a critical appraisal^ review, and revision of current testing ' 
programs had been conducted. 1 Known as the NEA moratorium on testing, the* 
Resolution remained in effect until 1978 when it was revised. . ^ I 

Several events during the years following the moratorium suggested a neecAq . 
reexamine NEA testing policy. The 1972 moratorium had~successfully alerted the. 
public to the dissatisfaction of many educators with existing tests and testing 
practices. The dissatisfaction, however, needed elaboration, especially in light of- 
widespread and often uncritical acceptance of standardized test scores, misinter- 
preted test results, and increased demand for testing. There was, too, professional 
recognition that some tests, if carefully constructed, could be instructional^ useful.' 

On Further Examination and NEA Response 

One eyent in particular precipitated an Association response. The event was the. 
^ release in 1977 of On Further Examination, published by the College Examination 
j Board (CEB). The study was conducted by an advisory panel of 21 people chaired 
r by former Secretary of Labor Willard Wirtz. The purpose of the study was to 

investigate declining Scholastic Aptitude Test (SAT) scores among high school 
• students. The. study was sponsored and funded" by CEB and Educational Testing 

Service (ETS). CEB sponsors the SAT; ETS develops and administers it. , 

On Further Examination was the study, of the 14-year, decline in SAT scores from 
1963 to 1977. Verbal scores had dropped 49 points, from 478 in 1963 to 429 in 1977. 
During this same period; mathematics scores had dropped 32 points, from 502 to 
470. The question posed to the Wirtz panel was, why? 



The report represented a comprehensive analysi^of social and educational change 
believed to be reflected in test scores. The change and resulting test score decline 
were proposed to have occurred in t,wo stages, each characterised by different 
causal factors. The first premise explained score changes from 1963 to 1969 as the 
result of a changing populatiori\of students taking the SAT. The population 
purportedly included "larger proportions of students of characteristically lower- 
scoring groups of students/ 12 The second premise attributed the decline from 1970° 
to 1977 to various social and educational factors. 3 

» * 

NEA prepared three responses to the Wirtz report. The first response appeared as 
an editorial by NEA PresidentJohn Ryor in the November-December 1977 issue of 
Today's Education. Ryor acknowledged in the editorial the panel's effort to be fair, 
to demonstrate some understanding of the different tasks of teachers, and to 
express awareness of some o£ the criteria guiding the use of SAT .scores. Ryor 
concludecj/ however, that the report could provoke a. misuse of test data by. 
legislatures who would see only declining test scores and ighore the caution against 
imposing upon the schools more rigidity and uniformity. 4 Ryor's primary objection 
was that the panel examined test results, not the test itself. Thus, the fundamental 
and unanswered question was: "Should a SAT* test which hasn't changed signifi- 
cantly in 36 years be allowed to become a major determinant of school 
curriculum?" 5 ' 

The secoqd response was a booklet entitled On Further Examination of 4 On 
Further Examination' polished in 1977 by NEA Instruction and Professional. 
Develdpment. This publication commended the Wirtz panel for its lack of indict- 
ment, attention to multiple rather than single questions and theories, and consulta- 
tion with some experts. Nevertheless, the publication argued that the examination 
of declining SAT scores was unfinished. The paper analyzed panel comments about 
the SAT, teaching, and selected aspects of .society. The paper noted that panel 
members carefully avoided an analysis of the test itself and that further examination 
of Jthe declining test scores, should address many more issues jsuch as questions of 
validity (particularly predictive validity) and cultural bias, the assumption that 
educational content and performance'standards remain unchanged over time, and 
whether SAT could or should measure such skills as thoughtful andcritical reading 
and careful writing. 6 / 9 

The third response was preparedtby the NEA Special Committee on Declining SAT 
Scores. Appointed in 1977 by John Ryor, the five-member committee was charged 
with three tasks; . 

^ • 

o * • To analyze On Further Examination 

• * To review NEA's current policy on testing 

' • , To develop a set of policy recommendations. 1 

|p A 1978 the NEA committee submitted the results of its investigation entitled *NEA *s 
-Analysis of the Wirtz ^Report on. Declining SAT Scores. (A copy of the executive 

summary of this reporTappears in Appendix C.) Among conclusions reached were: 

** » * 

• The conclusions in the Wirtz report exceed those that can be reasonably 
drawn from the provided descriptive statistics. 7 

• The SAT has been constructed to ensure test reliability at the expense of 
test validity. 8 

• Item selection is based more on the power of items to differentiate among 
students than omthe match of items with instryctional* content. 9 

' • The value of the SAT is questionable when questions of validity are 
addressed. 10 
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• The two premises used to explain the SAT score decline lack objective 
documentation and are not generalizable. 11 . 

• There is no evidence to support the view that students. learn less today 
than their counterparts in the past. 12 

The remaining charges of the NEA Special Committee were lo review existing 
testing policy and prepare policy recommendations. Based on its review of existing 
policy, the Committee concluded that several policy changes were desirable. The 
Committee believed that some tests could be instructional^ useful to teachers and 
that such tests should be supported. The Committee also believed that many tests 
were inappropriate for educational measurement and evaluation and thaj steps 
should be taken fo help teachers become bettehnformed about the meaning of tests 
and the use of test data. 13 

Recommendations of the Special Committee were presented to members of the 
1978 Representative Assembly meeting in Dallas, Texas. In response to the recom- 
rtiendations, the Assembly revised the 1972 Resolution. The new Resolution 78-82 
on "Standardized Testing" recognized that student* testing could serve important 
educational purposes such as diagnosing learning needs, prescribing instructional 
activities, and measuring student progress in curriculum content. 14 The Resolution 
supported the use of tests prepared or selected by the teacher and made explicit 
NEA opposition to standardized tests which are: 

• Pamaging to a student's self-concept and contribute to the self-fulfilling 
prophecy whereby a studenti\chievement tends to fulfill the negative 
expectations held by others. ▼ " 

Biased against those who ^re economically disadvantaged~or'who are 
culturally and linguistically different; _ 

Used for tracking students. 

Invalid, unreliable, out-of-date, and restricted to the measurement of 
cognitive skills. 

Used as a basis for the allocation of federal, state, or local funds. 

Used by book publishers and testing companies to promote their finan- 
cial interests rather than to improve measurement and* instruction. 

Used by the media as a basis fot invidious public comparisons of student 
achievement test scores., 

.Used to test performance levels as acriterion for high school graduation. 
Used to evaluate teachers. * 

The Wirtz report was an occasion for the Association to expjyn why it opposed so 
many tests and testing practices. Recent proposals for truth-in-testing legislation 
have provided opportunities to describe testing changes thfe Association advocates. 
(See Appendix D for the 1980 Resolutions which further elaborate the NEA 
position on testing.) 



NEA testing policy 
reviewed 



NEA policy revised 



1980 NEA policy 



Two TegSng Changes NEA Advocates 



On August 1, 1979, and again on Octaber 10, 1979, the NEA testified before the 
Subcommittee on Elementary, Secondary, and Vocational Education. The Com- 
mittee was considering two proposals for truth-in-testing legislation. Both propos- 
als, whose contents are discussed in Section V t concerned truth and disclosure 
legislation. During testimoriy the Association" expressed^e belief that certain 
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changes in testingcould improve tests and the way they are used. A copy of the NEA 
analysis of the federal, proposals appears in Appendix E. The changes, elaborated/ 
for this publication, are discussed below. ' J 



The need for 

criterion-referenced 

tests 



X 



The problem of 
clarity 



The problem of 
meaning 



Criterion-Referenced Tests * * • ^ 

One change already occurring but on a limited scale is Jhe u*e of criterion- 
referenced tests. Criterion-referenced tests are those in which individual .perfor- 
mance is described in terms of specific instructional content or performance 
objectives rather than in terms of the performance of others. 

The popular notion of criterion-referenced testing may ha>e come from the distinc- 
tion made by Robert Glaser in 1963. Concerned about the failure to use tests for 
instructional purposes; Glaser appealed for tests that CQuld be interpreted directly 
in terms of defined educational content. He distinguished between test scores whose 
^interpretation indicated what an individual touldactually do and test scores whose 
''interpretation indicated what an individual could dowh&compared to* others. The 
former were criterion-referenced tests; the latter, nornMferenced. 15 



Criterion-referenced tests are an alternative approach to traditional tests. Yhey 
have several- characteristics which make , them instructional^ useful. If well 
designed and carefully constructed, criterion measures describe with considerable 
clarity the specific knowledge alid skill measured. Thus, teachers can select or 
develop tests better matched with actual instruction and educational objectives. 
The measures will be more accurate, the quality of test data will be improved, and 
th^v^foraiatioif about achievement -and progress will better serve the goals of 
Sfrfrto^d instruction. * " - * ' * 



^ Criterion-referenced tests are designed to describe pferforfriance relative to instruc- 
tional tfontfent. Measures that succeed in this respect can be expected lo make more 
* sense; Jfco students ai&ie^chers.' Success and errorcan be more readily understood^ 
terms of specifics rathfcr tij&n vague abstract ions/The interpretation of test scores in c 

'terms of specific 5 contented skill also makes more manageable the task of under- 
stand itfg^rrotiind ' \ , i * . - 

Criterion testing' will' not end*the current practice of norm referencing. The belief 
has somehow emerged that some tests are criterion referenced and others.are norm 
referenced but that a test cannot be both, in fact, a te^t score can be interpreted both 
ways provided test content js .precisely described. 

Nor will critefion testing t>e problem free. One problem is the difficulty of achieving 
descriptive clarity of the content and behaviors to be measured. Various frame- 
works have been developed to help promote descriptive clarity. Among available 
frameworks are various theoretical constructs- such as cognitive and affective 
domains and structure-of^intellect models; instructional, behavioral, or perfor- 
mance objectives; content-processing matrices; and formal rules for itert 
development.' ^ ^ 

A second problem is the difficulty of attributing meaning to test scores. Criterion 
scores have been expressed as expectancies, predictors; diagnostic signs, and indica- 
tors of mastery. The terms imply a performance standard or cutoff point. G.V. 
Glass has argued that existing methods of determining criterion scores are arbitrary 
and that interpretations based on absolute standards are meaningless given existing 
knowledge. 16 Glass asserts that f *the only sensible^interpretation of data from 
assessment programs will be based solely on whether the rate of performance goes 
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up or down." 17 If this is the case, then new interpretive guidelfnes\Wll be necessary if 
indicators of the direction and rate of change are to make instructional sense. 

Bur os Reform Proposal 

Criterion testing is one change encouraged by the NEA. A second change that holds 
promise f<jr educational measurement is embedded in Oscar^Buros' proposal for 
test reform. Buros favored tests built for the purpose of measurement rather than 
differentiation. 1 * To achieve this end, he proposed developing different tests to 
measure the achievement of groups and the achievement of individuals. Group tests 
would be used to measure groups such as schools or school systems with common 
objectives and learning environments. Individual tests would be used to>measure 
individuals. Group tests could cover both common objectives aqd objectives unique 
to a school or school system andcould be administered to a sample of students. 
Individual tests would cover those objectives unique to local objectives for which 
measures of specific individual growth'would be desired. 

Buros believed methods of reporting test>data could be simplified. He advocated 
local rather than national norms, raw score me^hs, and frequency distributions of 
raw score means calculated for item scores and total test scores. He also believed 
that individual scores could be more meaningfully reported if the raw score were 
reported as a percentage of the possible total score andalso if percentile rank within 
grade were reported. 19 For example, a descriptive^record of 80/65 for a given 
* , student would indicate that within a given grade that student successfully answered 
$80 percent of the items and scpred as well or better than 65 percent of the other 
students locally.^ 

Example of Feasibility of Some Advocated Changes 

The Association favors. the use of criterion measures for both group and individual 
tests, and it favors reporting data in more usable fprnis. The feasibility of accom- 
plishing this for groups on a large-scale basis has already been demonstrated by the 
National Assessment of Education Progress (NAEP)'. NAEP is an example of 
criterion-referenced testing. The broad purpose of NAEP isrto measure the nation's 
educational progress, and the function of the various test exercises is to describe 
. achievement in terftis of educational content and specific instructional and behav- 
ioral goals. Exercises are statistically sorted into booklets, booklets are adminis- 
tered to individuals selected to represent significant characteristics such as age and 
geographical region, arid test data are -reported by subject area, age group, and 
instructional content. 

There are many features that distinguish NAEP festingVom standardizedachieve- 
ment tests. Test exercises are developed to measure educational objectives consist- 
ent with instruction. The selection of test items is based on their match with 
instructional content tather than on their power to discriminate among students. 
Sampling procedures allow for the assessment of many cognitive and affective 
objectives withput subjecting students to lengthy test sessions. Results of the daft . 
ate also easily understood by professional and lay aifdiences; ' * 

s There is an additional feature of the NAEP program worth noting. NAEP is 
governed by a relatively open testing policy. That is, the theoretical and practical 
aspects of test development are richly documented and accessible: Furthermore, » 
reported data are accompanied by actual test items and their correct ariswers (up to/ 
half of all NAEP jtems are released ^ftfcr test administration). Thus, one knows the ' 
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objective measured, the instrument of measure, and the results which are reported 
by objective and item and are portrayed in various ways. This disclosure has 
responsibly informed people about the test, and it has also provided educators with 
information and ideas that are instructional^ useful. (See Appendix F for NEA's 
letter of support to the Education Commission of the states regarding the, National 
Assessment of Education Progress.) 



N£A supports 
open tests 



Other Changes NEA Supports 

NE A supports the idea of open tests and believes that the release of all test items and 
their answers will be a significant change in educational measurement. The Associa- 
tion respects the idea of secure test items prior to test administration provided there 
is reason to believe they are well designed,- well constructed, theoretically sound, 
and instructionally relevant. After test administration, NEA believes students have 
a right tcpnspect their own performance and <o have the opportunity to learn from 
their successes and errors. 

The Association supports a number of other efforts to improve testing in the United 
States, Among such efforts are local test development; construction of a variety of 
measures including observation and student self-reports; sequential testing where 
items and tests,are tailored for individuals; item banks with items classified, stored, 
and retrieved according to specified content, format, and difficulty; and computer- 
generate^ tests constructed to meet certain specifications. 

For over a decade NEA has advocated change in the way fasting is viewed and 
practiced in the United States. The Association believes that change will construe 
tively occur when testing and instruction aim toward the same objectives and are 
desighed for the s^me purpose of providing the best education possible for all 
individuals. 
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SECTION V: TRUTH-IN-TESTING LEGISLATION 



I 

TV 



In 1978 the California legislature examined a proposal concerning information 
about tests. The proposal required test publishers to" disclose to the California 
Postsecondary Education Commission descriptive information about test content, 
test validity, standards^administration, expenses incurred, and income. It also 
required publishers to provide to test takers descriptive information about test 
content, test purpose and use, treatment of scores, and score ownership. The 
proposal applied only to standardized tests administered to 3,000 or more students 
for the purpose of postsecondary admissions selection. The legislation wasenaGted 
in September 1978 and became the first* truth-in-testing law in the United States. 

In 1979 similar legislation was enacted in New York. The New York law applied to 
tests used for postsecondary and professional school admission selection and 
specifically excluded civil service exams and tests used for other purposes. It 
required the disclosure of similar kinds of descriptive information required in 
California* Unlike California, which required disclosure only of test questions 

x equivalent to those actually used, the New York law required full disclosure of test 
items actually used. It was the full disclosure clause which made the New York 

, legislation controversial, even after it was enacted in July 1979. 

Similar proposals in other states — Florida, Ohio, and Pennsylvania* for example — 
and two at the federal level concern truth-in-testing. None of these proposals has yet 
be en enac ted, but others will undoubtedly be proposed and eventually made into 
law~as~tffe movement gUins momentum. 

Current truth-in-testing proposals and laws are aimed at standardized tests and 
represent, notice and disclosure legislation. As they are currently conceptualized, 
the proposals have been viewed as a variation of consumer protection Jegislation^ 
The legislation recognizes the right of consumers to be informed about^the products 
and services *they purchase. Consumers of testing include students who are tested 
and who often pay test fees, educational organizations such as the American 
" e Medical Association for whom special tests are developed, and the states with 
constitutional responsibility for public education. 

A number of issues are involved in truth-in-testing debates. -Some of the issues, 
although important, do not address the legislation directly. Some of these issues, 
identified in Searching for the TrutHpboui Truth-in-Testihg Legislation published 
by the Education Commission of the States, involve undifferentiated discussion of 
tests, undifferentiated discussion 6f the information needs of various individuals 
and groups in education, .and narrow focus on certain kinds of test performance. 1 
These are issues where testing opponents and proponents tend to talk pastj^h \J\ 
other rather than tackle the issue directly. As already mentioned, these a J Is- 
tangential to most legislative proposals. ' 



Issues that tend to* bear direction the legislation revolve around five i^ajor 
subjects: the need for tests, test.publfshers, test quality, the nee<hfor testing legisla- 
tion, and the. consequences of testing legislation. The arguments on both sides of 
reach issue are -summarized below. ,i 



California legislation 



New York legislation 



Primary issues 



s 



The Need for Tests 



The^cinds of tests under consideration are measures of achievement and aptitude/ 
. Their use is restricted usually topostsecondary ifanjissions selection. Proponents of 
* current practices point to the large numbers of students attempting to gain admis- 
sion to postsecondary schools and thejieed for information for selection purposes. 
With, limited budgets, space, and curficular programs, institutions need informa- 
tion^ help them select those students best qualified and most likely fo complete 
successfqjly a course of study. Test scores can supply this information more 
obje<^T»f^ sources. It is also argued that testsean he.Ip 

studerrtTielf-select postsecondary schools consistent with their own abilities and 
preferences.! 4 

Opponents argue that with college enrollments dropping and universities in need of 
revenue, the need to select and reject certain students has diminished. The tests 
systematically penalize certain groups of students and function more effectively as 
instruments to maintain the status quo. The test results adversely affect the educa- 
tional aspirations and employment opportunities of many individuals and should 
not be used any way in publicly supported institutions or in institutions that 
compete for and accept federal tax dollars. 

Tesf Publishers 

The test publishers in question are those who produce standardized testsused for 
•postsecondary admissions selection. The exemplar chosen is often Educational 
Testing Service (ETS), producer of^ne of the mSre common tests, the SAT, used s 
for selection purposes. The central issue here is responsibility, or accountability, as 
it is called in the public sphere. 

Opponents of testing legislation argue that test publishers are accountable. In the 
marketplace they are one of numerous competitive testing companies with compar- 
able financial resources. Therefore, they cannot be regarded as a monopoly. Test ' 
publishers make their products and services available to institutions which are not 
forced to use them but rather exercise free choice in test selection and test use. Test 
publishers are accountable to clients who design testing programs and request 
special tests for program purposes. Test publishers are also accountable to the 
public under whose laws they are regulatedand whose educational members have 

access to many reports prepared regularly ror their benefit. 

1 , • 

Proponents of the legislation argue that publishers are in the business of measuring 
minds and 3o exercise great influence over what to think and how to think. Claims 
made for the power of the tfsts and for the science of their measurement lack 
convincing evidence, but the claims are neyertheless repeatedly made. The largest m 
numbkisAf people who actually purchase and "use" tests and 1 testing services are 
students who have inadequate knowledge of the nature of the measUre to which they 
submitWnd the use that will be made of the data they individually provide. 

Test Quality • 

Opponents 1 of testing legislation argife that tests are theoretically and technically^ 
sound, given existing knowledge, and reflect social and educational values asso-* 
ciated with intellectual development apd cognitive power. Opponent^ do not claim 
that tests are designed for comprehensive personal, social, or intellectual measure- 
ment; nor do viey claim that e|istmj$ tests fan assess such qualities as creativity, 
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imagination, and persistence. Specific item weaknesses have, been acknowledged 
but often with the defense that items undergo an extensive review and revision 
process and that efforts are made constantly to improve test content. Technically 
speaking, opponents agree that test validity is difficult to achieve, but opponents 
argue that efforts are made continually to gather validity data. They £lso argue that 
tests do what they were designed to do, and nothing more. They were not designed 
to predict with perfect accuracy the future of individuals. They were designed rather 
to improve short-term predictions about people; and this, it 4 isT claimed, they 
generally do. 

/'Proponents of testing legislation challenge current theoretical models of intelli- 
gence or innate capacity. What the tests measure, they say, are skills and content 
that can be taught. The use of these tests consequently influences what is taught,, , 
what is learned, and what is thought. Tests also fail to capture the range of human 
qualities that are involved in various human endeavors such as pursuing a course of 

^study and workingtoward anacademic degree. Technical arguments by proponents 
frequently involve criticism of specific test items as a way to illustrate a range of 
problems with the test such as cultural bias, ambiguity, and over-simplified logic. 
Technical quality is challenged particularly with respect to predictive validity which 
opponents of the tests say lacks convincing evidence and does not improve upon 
existing predictors such as grade point average. 



Conceptual and 
technical arguments 



The Need for Testing Legislation 

Opponents of testing legislation argue that the need for legislation has not been 
•demonstrated. They reject the logic behind arguments that test producers and tests 
control or adversely influence educational content and ways of tHinking. They, 
refute a rgument^for mbre information by noting the amount and kind of informa- 
tion already provided and make the case for secure testing in the name of quality 
control. Government regulation, they argue, is unnecessary and in the case of 
federal regulation violates states* rights to control education. They also argue that 
such regulation is obtrusive and unconstitutional intervention. 

Proponents of testing legislation argue that more information about tests is neces- 
sary if tests are to be wisely chosen and judiciously used. This information can be 
supplied by test publishers who have steadfastly refused to release it. They argue - 
that institutions are bound by various state and federal regulations that require 
theft to meet certain standards and achieve certain aims. The/ederal government 
has been involved in education since military academies were established in th^ 
eighteenth century but particularly in the post World War II period. They also 
argue from anjmalogy between test materials and services and consumer products 
now under federal regulation. They argue that the consumer has a right to be fully 
• informed aboutthe nature of the product or service he or she purchases whetherthe 
, product is a hair dryer, automobile, or test. 



The question of 
information control 



Conveniences of Legislation 

The consequences of testing legislation ar^, from the proponents* point of view, 
largely positive. Legislation will force producers tobe accountable to test users and 
test takers, will result in the dissemination of quality information, will open tests to 
the scrutiny of many people including professional educators and researchers, and 
will result ultimately in improved tests «and improved use of test information. 
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The consequences of testing legislation are viewed less optimistically by opponents 
Twt producers argue that proposed changes, particularly full disclosure clauses 
w.ll.increase the cost of test production. These'costs in turn will be parsed on to 
students who will pay more for-each test;^oor students will be affected most. Test 
producers believe-testing legislation will adversely*affect test quality and will lead to 
the withdrawal of some tests in states with testing controls. Ultimately decision 
makers will be forced to rely on less accurate information and, therefore, to make 
arbitrary decisions about individuals. 



Open versus Secure Testing 

By far the most explosive issue in truth-in-testing legislation is the full disclosure 
clause which mandates the release of actuaUest questions and answers soon after 
the test has been given. Qp^n testing means test disclosure. Secure testing means no 
test disclosure even after the test has been administered. The issue of open verStos 
secure testing involves test information aqd its accessibility. ' : 



Disclosure clauses in truth-in-testing legislation would open tests after administra- 
tion to public scrutiny. Arguments in support of open testing appeat primarily to 
the test taker's right tp be informed and the test producer's obligation to provide 
that information. 

The right to be raformeWsometimes'treated as a right in itself or as a matter of 
ethics. 2 When an individual islested; it is argued that he or she has the right to know 
the results of the examination, the meaning attributed to the results, and °th*e 
original data. Usually the discussion of rights shifts to decision making where test' 
results are involved in decisions such.as college admission .that affect the test taker. 
With more at stake* the test taker has the right to examine and judge the kind qf test 
data he or she provides for decision-making purposes. Most Often the right-to- 
know argument is expressed as a. matter of fairness where personal feelings are set 
aside in an effort to'achieve a balance of conflicting interest. If the test talker must 
submit to testing and accept the results*"then/airness involves the opportunity to be 
fully informed of the data and the standards of judgment, y . \ 

Because tests function as instruments of social policy, test producers have a respon- 
sibility to inform test takers and the public about the instruments provided. This is 
an accountability argument, a»d it is-appropriate in the public sector. This argu- 
ment affirms the belief that those entrusted with public institutions must be accoun- 
table to the public which supports them.. One aspect of this accountability is to 
lncrea^ejpformation about the instruments used to decide whS will and who will- 
not ahendjiiblic institutions. 



Arguments to support secure testing involve test qualify, controlled costs, and 1 
constitutional questions. Secure testing is believed to be a necessary- condition for 
test quality. One technical characteristic of quality is test validity. Test validation is 
a process of providing, evidence that the test measures what it is supposed to 
mpasure. In cases where multiple forms of eachjestare developed eaclWear and for 
successive years, some effort must be made to make the various tesUorms within a 
.given year and across successive years equally valid. The procedure for establishing 
the equality of multiple tests across time is called equating. It involves reusing test 
items in successive test administration. Open testing would require that test content- 
be disclosed sometime after test administration. This disclosure would damage test 
validity by exposing those items intended for reuse: Thus, it woulfl end' current 
equating practices. Other concerns wifh-tCst validity involve those subject areas for 
which a limited number of test items exist. Open testing would eventuallyexposeall 
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items and would increasingly erode test validity. The end result would be a dimin- 
ished confidence~in tests whose quality and usefulness would* be eroded through > 
exposure. , 

Open testing requires that new test forms be continually developed for each test 
administration .and that' new methods of equating the forms be develope<JrThe 
process of research and development needed to achieve this would be expensive. 
These costs would eventually be passedtm to test tafkers. Thus, the legislation would 
force costs upward and would affect everyone. State and federal regulation would 
inflate^ (he cost of testing. 

^ Open testing is also viewed as unconstitutional.; One claim is that open testing 
infringes on First Amendment rights interpreted in this argument as an institutional 
right to decide who will be admitted to college and also as aji individual researcher's 
right to determine whether or not her or his research will be made public. The latter 
enters into the debate because sbrhe research on testing is conducted by private 
individuals who have no financial relationship with test publishers but whose' 
research helps establish various technical qualities of tests. n 

A second claim invokes the Fiftttand Fourteenth Amendments. The Fifth Amend- 
v ment prohibits the federal government from depriving any person of life, liberty, or 
property withdut due process of law. The Fourteenth Amendment extends the 
^ provisions of the Fifth Amendment to include state governments. Private property 
in testing legislation refers to tests and related test* data. The claim of private 
property is strengthened by test cqpyrights which bring tests under the protection of 
the Federal Copyright Act of 1976. 3 Given existingia\y, thedisclosurexlause would 
j deprive .test producers of exclusive rights to their tests and would in effect destroy 
their value for future use. 

The jssue of open versus closed testing is complex. It has attracted considerable 
attention from various groups and individuals, and it is like)); to.persist. For thdse 
interested in following the debate nationally and within their respective states, two 
, well-documented and reasoned publications^ are worth study. One paper is The 
Debate Over Open Versus Secure Testing: A* Critical Review written by Andrew 
Strenio, Jr. 4 Prepared for the National Consortium on Testing, the paper examines 
the case for testing legislation, the case fof perpetuating existing test practices, and 
the strengths and weaknesses of the arguments. The second publication was pre- 
pared by the* Education Commission of the States and is entitled SeariMmgfor the 
Is / Truth about Truth-in- Testing Legislation. 5 Prspartdjfor legislators; thej||>rt pays 
close attention to legislative arguments, easting law, &nd legal implications of 
• testing legislation. / \ 



The problem of cost 




Constitutional 
considerations 



the NEA Position on Truth-in-testing Legislation , „ > 

In June 1979 the NEA Representative Assembly voted to urge a .congressional 
investigation of the standardised testing industry, the tax-exempt status of testing 
companies, and the need for truth-in-testing legislation, In August and again in 
"October 1979,' the Association presented testimony oil two federal truth-in-testyig 
legislation proposals being studied by the Subcommittee ojl Elementary, Second 
ary, and Vocational Education. The proposals, the Truth-to-Testmg Act of 1979 
(H.R. 3564) and the Educational Testing Act of 1979 (H.R. 4949), both represented 
notice and disclosure legislation which the Association supported. (See Appen- 
dix E.) : - ' 



HEA supports 
trutfr-in- testing 
legislation 
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The Association favors truth-in-testing legislation. The legislation represents an 
effort to promote public accountability of the testing- industry and also of the 
scjiools.. The legislation will make possible access to information necessary for 
responsible test selection and. use. The legislation will fSrther the aim of needed test 
reform. Above all, truth-in-testing legislation will provide information to the 
people who can benefit most from open testing and full disclosure: students whose 
intellectual growth and development can bMihanced by personal knowledge of 
their measured achievement and whose preparation for college and career entr 
benefit from quality test data timely provided. 

*' / 
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ipendix A 



? SURVEY 
COMMERCIAL INVOLVEMEP 



PARTICIPANTS FOR 
IN STATEWIDE TESTING PROGRAMS 



1979/Survey Participants 



I. OFFICIALS OF STATE BOARDS OF EDUCATION 



Alabama: 
Alaska: 
Arizona 
Arkansas: 



California: 
Colorado: 
Connecticut: 

Delaware: 

District: 

Florida: 

Georgia: 

Hawaii: 

Idaho: 

Illinois: 

Indiana: 

Iowa: 

Kansas: 

Kentucky: 

Louisiana: 

Maine: 

Maryland: < 

Massachusetts: 

Michigan: 

Minnesota: 

Mississippi: 



Clinton Owens 
Ernest Polley 
Steve Stevens 

James Washburn 
Connie Darden 
Dale Carlson 
James Hennes 

Douglas Rendone 

George Kinkaide 
Robert Bigelow 

Robert Farr 
Thom&s Fisher 

Elizabeth Creech 
Selvin Chin-Chapce 

Karen Undertook 
John Alford ^ 

Ronald Hartman 7 
Max Morrison 
Judy Hamilton 
Armand Discontini 
Hugh Peck 

Betty McLabghlin 
WUliam Qn^nt 
Mathew Towle 
Edward Roeber 

William McMilUfii 
Rex Pouncey 



(205) 


832-3402 


Missouri: * 


Charles Foster 


(314) 751-3545 


(907) 


465-2967 


Montana: 


Bill Connett 


(406) 449-3693 


(602) 


255-5837 


Nebraska: 


Harriet Egerson 


(402) 471-2444 






Nevada: 


George Barnes 


(702) 885-5700 


(501V 


'371-1464 


New Hampshire: 


James Carr 


(603) 271-3740 




445-4338 


New Jersey: 


Carl Johnson 


(609) 292-4450 


(308) 


839-2111 


New Mexico: 


Bayla Nochumson 


(505) 827-2282 


(203) 


566-8250 


New York: 


Windsor Lott 


(518) 474-5099 


(203) 


566-7232 


North Carolina: 


Robert Evans . 


(919) 733-3813 


X302) 


678-4583 


North Dakota: 


Hank Landes 


(701) 224-2391 


/ (202) 


724-4164 


Ohio: 


Ken Higgins 


■ (614) 466-4868 


(904) 


488-8198 


Oklahoma: 


James Casey 


(405) 521^196 


(404) .656-2661 


Oregon: 


Susan Holmes -, 


(503) 378-3583 


(808) 


656-2661 


Pennsylvania: 


Robert Coldiron 


(717) 787-4234 


- (208) 


384-2113 


Puerto Rico: 


. Edith Vasquez 


(809) 754-0964 


(217) 


782-4984 


Rhode Island: 


Martha Highsmith 


(401) 277-3126 


' (317) 


927-0241 


South Carolina: 


Terry Helsley 


(803) 758-8610 


(515) 


281-5274 ; - 


South Dakota: 


Robert Huckins 


(605) 773-3371 


■ (913) 


296-3201 


Tennessee: 


Jesse Warren 


(615). 741-1099 


(502) 


564-4394 


Texas: 


Keith Cruse 


(512) 475-2066 


;v(504) 


342-3750 


Utah: 


Dave Nelson 


i801) 533-5461 


#07) 


289-2033 


Vermont: 


Karlene Russell 


(802) 828-3111 


(301) 


796-8300 Ext 328 


Virginia: 


Richard Boyer 


(804) 786-2624 


(617) 


727-0190 


Washington:^ 


Gordon Ensign 


(206) 753-3449 


(517) 


373-8393 


- West Virginia: 


Doris White 


(304) 348-3230 


(612) 


296-6002 - 


Wisconsin: 


James Gold 


- (608) 266-3390 


(601) 


354-6979 


Wyoming: 


Lynn Simons 


(307) 777-7673 
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II. OTHER PARTIES CONTACTEp , 

1. EDUCATION COMMISSION OF THE STATES: Jack Schnifdt (303) 861-4917 

2. SCHOLASTIC TESTING SERVICE: John Kauffm'an ^313)^665-0089 ' 

3. TOUCHSTONE APPLIED SCIENCE ASSOCIATES: Dr. Bertram Koslin r .(914) 592-2630 
60 4 ■ <r 
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Appendix B tf 
SUMMARY OF CONSULTANT ACTIVITY BY FIRM AND STATE 

i. AMERICAN COLLEGE TESTING PROGRAM 

Nevada: ACT assisted in establishing the validity of items for the 
• ' Nevada Competency Xest Program. This test is currently 
* 1 given in the ninth grade and eventually will be given in the 

twelfth grade. , ' . 

2/ AMERICAN INSTITUTES FOR RESEARCH 

Michigan: AI R assisted in developing the tests used in the Michigan 
, m Educational Assessment Program. Under the program 

tests are now given in grades 4, 7, and 10. ' 

3. BOZLER EDUCATIONAL CONSULTANTS 

Ne* Hampshire: BEC is assisting in field testing and report writing font he 
Nejy Hampshire Educational Assessment Program! 

4. EDUCATIONAL TESTING SERVICE' 

Alabama: ETS consulted on the validity of a state competency test 
piloted in 1979. The test will eventually^be given in grades 
3, 6, and 9. 

Georgia: ETS advised orrthe development of the Georgia Criter- 
■ * ion Reference Tests. These include tests in 'reading, 
^ • mathematics and career development'^ grades 4, 6, and 8 
and a tenth grad^test in mathematics and communica- 
tions skills. The current contractor is the University of 
, Georgian. 

Minnesota: ETS developed the'PSAT and the SAT tests, which are 
offered to school districts through the University of Min- 
nesota's Student Counseling Bureau. 

Nevada: ETS advised on the procedure for writing test items used 
in the Nevada Competency Test*Program. This test is 
currently given in the ninth grade and will eventually be 
given in the twelfth grade. 

New Jersey: ETS assisted in item development for the New Jersey 
Minimum Basic Skills Tests. These competency tests in 
reading and mathematics are given in grades 3, 6, 9, and 
M. The current contractor^for new items is NES. 
* l * ' * * 
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Pifcrto Rico: ETS consults on a continuing basis regarding the valida- 
tion and,interpretation of results for Pruebas de Stresas 
Basicas (Tests of Basic Skills)7These include achievement 
tests in mathematics and Spanish reading in grades 2 and 
^ 3, plus tests in English given in grades 4, 5, and 9. ETS 
plays a similar role with respect to the Prueba de Abilidad 
** General, which is given in grades 4, 7, and 10. 

Texas: ETS is currently developing an item pool for the Texas 
Assessment of Basic Skills. These tests cover math and 
reading and are administered in grades 5 and 9. 

5; INSTITUTE FOR BEHAVIORAL RESEARCH AND CREATIVITY 

Utah: IBRC advised on goal development and item.validity for 
sections of the Utah Statewide Assessment Battery, 
which deal with emotional maturity, music, and art. 

6. INSTRUCTIONAL OBJECTIVES EXCHANGE 

Virginia: IOE produced the portion of the Virginia-<jraduatioi> 
Competency Test which covers reading. 

7. INTRAN v 1 

> ) 

Louisiana: INTRAN agisted in item design for the portion of the 
Louisiana Assessme^ Program that deals with reading. 
The tests are currently administered in grades 4, 8, and- 
1 1. In 1982 this test will become a pass/ faintest control- 
* * > ling movement to higher grades, starting with grade 2. 

8. MCGRAW-HILL (CTB) • 

District of Columbia: CTB helped develop the cusVDmized Prescriptive Read- 
ing and Math Tests that are used in the District." 

9. NATIOfiAL EVALUATION SYSTEMS 

Connecticut: NES assisted in the development of the 'Connecticut 
Assessment of Educational Progress. 

, Georgia: NES is assisting in the development of a kindergarten test 
< , . for springJ980. - 

Hawaii: NES is assisting in the development of an item pool and 
item design for a competency test program. The test will 
be administered in the third -grade in 19$0-$I and will 
eventually be given in th&sixtfc^eighth, and tenth grades. 

Maryland: NES is assisting in the development of competency tests 
\ in mathematics, writing, citizenship, survival, and the 
J world 6f work. _ 

Massachusetts: NES assisted in the development of an item pool and 
selected a sample of communities for field testing of the 
Massachusetts Assessment of Basic Skills. 

New Jersey: NES is assisting in the development of new items for the 
Minimum Basic Skills Test. This test is administered in 
" grades^, 6, ,9, and 11 in reading and mathematics. 
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Rhode Island: NES assisted in item development for the Rhode Island 
; . Life Skills Test. Thte test is administered in the eleventh 

grade. The, University of Rhode Island Curriculum and 
* x • Research ah3 development Center holds the current 

*£<lhtract. * 

Virginia: NES is assisting in item development and field testing for 
thp Basic Learning Skills Test Program. The tests cover 
reading and mathematics and are given in grades 1 
. r . " through 3. In 1980 the tests* will be extended to grades 4 
'* • through 6. m ° * • * 

10. jvati6nai|.testing service ' ' * j °- 

. v, ^MS^ are: ' NTS is assisting in the development of an item pool that 
• * • ^ s'chQol districts may use*in designing local competency 

* tests. Local school district must test but do not have to 
*l t \, J '< ** tise th£ item ^ool. , • 

. Louisiana: NT&is assisting in item development for the writing and 
4 maAmatics portiohs of the Louisiana Assessment Pro- 
, •: , • ... /gi^n7"The tests are Currently administered in grades 4, 8, 

/ ' s " arid H. In 1982^tl^Bwill become a pass-fail test control- 

; ' - ling movement to the^higher grades, starting with grade 2. 

11. NORTH WESt EVALUATION ASSOCIATION '\ ' ' • 

1 (consortium of state and local school officials urOregoh and Washington). 

/Wisconsin: The Northwest Evaluation Association is assisting in the 
development of an item pool that school districts m^y use 
* 'at their discretion. 

12. NORTHWEST REGIONAL LABORATORY 

^ Alaska: NRL assisted in item development for the, Alaska State- 
m wide Assessment. ' : 

Idaho! NRL assisted in item development for the Idaho Profi- 
% ciency Test. ? 

Oregon: NRL assisted in the development of an item pool and 
field testing for the Oregon Statewide Assessment 

13. RESEARCH MANAGEMENT CORPORATION ' I 
(part of UNCO in Washington, D.C.) 

NeW Hampshire: RMC is currently assisting in. item design for the New 
Hampshire Statewide Assessment Program. 

14. RESEARCH TRIANGLE 

Illinois: RT assisted in the development of the Illinois Inventory 
ofv Educational Progress. This test is used to provide 
sample assessments in reading, mathematics, and 
citizenship. 

Maine: RT is assisting in the development of a test to replace the 
Maine Assessment of Educational Progress. * 
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15. SCHOLASTIC TESTING SERVICES , 

v North Carolina: STS produced two out of three of the currently used , 

versions of the Minimal Competency Test. , > 

V 

Tennessee: STS assisted in thfe-development of the-Basic Skills Test. 

This competency test in reading, lartguage arts, spelling, 
and mathematics is administered in the eighth grade. 

Virginia: STS produced the mathematics component of the Vir- 
.^ginia Graduation Competency Test. 

16. SCIENCE RESEARCH ASSOCIATES (IBk). 

Missouri: SR A c developed the customized sixth-grade tests in read- 
ing and mathematics that are part of the Missouri State- 
wide' Testing Progi^m. All tests in this urogram are 
offered to school districts but are not required or used on 
a voluntary sample basis. ; 

17. TOUCHSTONE APPLIED SCIENCE ASSOCIATES 

Connecticut: TASA produced the Degrees of Reading Power, a test of 
r-* reading pr oficieficy or competence. In 1 979-80 Connecti- 
cut used the test in the ninth grade to identify students 
who need remediation. f 

New York: TASA produced the Degrees of Reading Power and the 
* ■* Degrees of Writing Power.' Both of these tests are part of 

New York's competency testing parage. The Degrees of 
' Reading Power attempts to delirfmine what someone can 
read in the way of ordinary prose: It is currently given in 
the sixth, ninth, eleventh, and twelfth grades. In January 
1981, passiiig this test will be a requirement forgradua- 
tion. It will be administered three times a year to eleventh 
and twelfth graders, so that a student is given six chances 
to pass the test. The Degrees of Writing Power attempts 
to determine how well students can write,.compared with 
predefined characteristics of good writing. The test is 
? teacher-scored. It was administered in 1978 to ninth- 

' $ , grade studeifts who were not in the Regents Program 
(college-b.ound track). It' will -eventually be administered 
on the same basis as the Degrees of Reading Power. 

* The Degrees of Reading Power was produced by TASA 

, .under contract with the New York State' Board of 
* Regents. . Dr. Bertram Koslin, 4Vho once co-owned 

TASA, seated that the contract involved federal funds 
tapped by New York. However, the test is now jointly 
owned by the'New York State Board of Regents and the 
College Examination Board, wl)ich is marketing the test. 
*" The College Board plans to b)iy out the Regents share 
and become the sole owner. 
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. f f Appendix C 

NEA'S ANALYSIS OF THE WIRTZ R^ORT ON DECLINING SAT SCORES 



EXECUTIVE SUMMARY 

For n\ore than one hundred- years psychologists and educators have been 
using tests to measure human abilities. The 1880's were Galton's decade in the 
mental testing field, followed by Cattell. (1890's) and Binet (1900's). Actually 
tests and measurement as they affect our life today evolved from at least three 
interrelated developments: (1)' the study of .individuals who deviated from the 
norm, (2)Jhe experimental study of normal adult behavior, and (3) the develop- 
ment of mathematical models as tools for measurement. 

More recently, the use of mentat tests for sorting and selecting students by 
colleges artd universities has become the work of the Educational Testing Service 
(ETS), which is a private, nonprofit organization devoted to measurement and 
research primarily in the fieiaW education. It was founded in 1947 by the 
American Council on Education, th^Carnegie Foundation for the Advancement 
of Teaching and the College Entrance^amination Beard (CEEB). 

Since 1972 ETS has had a budget of over $47 million, with a 1976 budget of 
$62.9 million. Testing activities amounted to $55.8 million of the revenue; the 
balance came from research, 4 development'instructional services, and other. Act- 
ually, $2.9 million (4:6 percent) of ETS's revenue came from the federal govern- 
ment. 



Objectives • 

The five-member NEA Special Committee on the (Wirtz) Report on Declin- 
ing SAT Scores reviewed the charge from President Ryor and developed Wee 
objectives and nine related questions with which to analyze the CEEB-ETS report, 
as followsr 

OBJECTIVE ONE: To analyze On farther Examination; the College Entrance 
Examination Board's report of the advisory panel on the Scholastic Aptitude Test 
score decline. 

Question-No. 1: What were the highlights of the CEEB-ETS report on the 

declining SAT scores? 
*\ 

Question No 2: What were the significant findings of the CEEB-ETS study 
about the SAT score decline? 

n No. 3: Is there any evidence in the CEEB-ETS report that the SAT 
continue to be used by institutions of higher education as a standard 
to select students for admission? 
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Question No. 4: What were the implications- of the CEEB-ETS report for 
classroom instruction? 

OBJECTIVE TWO:^o review NEA's current policy on standardized testing, con- 
sidering the following: CEEB's On Further Examination, the attempt on the part 
> of selected members of the U.S. Congress to pass federal legislation on testing, the 
related impact on local schopl district curricula^and teacher evaluation. 

Question No. 5: What is NEA's current policy on standardized tests? 

Question /Vo, 6: Should NEA change its policy on standardized tests? 
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Question No. 7: What should NEA's position be x>n the "back to basics" 
controversy and on%e attempts being m^de to reduce curricula offerings 
at the local tevel? 

• •.*«- - " X " • 

OBJECTIVE THI^EE :To develop a set of recommendations for presentation to 
the various levels cff NEA, governance and to alert standing* committees of policy 
recommendations. 

4. 

Question No/ 8: In .which areas are poliVy recommendations needed oji 
testing? v 

Question No. 9: Injyhich areas are recommendations needed to improve 
NEA program activities in the field of standardized testing?- 



CONCLUSIONS 

Three objectives and nine questions were used by NEA's Special Committee 
on Declining SAT scores to analyze the College Entrance Examination Board's 
report On Further Examination and make recommendations/The objective and a 
brief k statement of the Committee's conclusion about each objective are presented 
in this section. * 

* OBJECTIVE ONE: To analyze On Further* Examination, the College 
» c Entrance Examination Board's report of the advisory panel on the Scholastic 
_ Aptitude Test score decline. . 

An analysis of the five sections and related studies included in the Wirtz 
report produced a, mixed inaction about the findings. A substantial amount of 
evidence w^s presented in the form of descriptive statistics, which suggested that 
the study of the decline of the Scholastic Aptitude Test scores was not possible. 

A comparison of just the number of students completing*high schobl, enter- 
ing college, and taking the SAT suggests that the last 25 years has produced not 
only more students to c be educated but also a need for multiple -criteria (stand- 
ards), not jugt one criterion that applies t<^all students throughout the country. 
For the CEEB panel to have extended its study beyond the. demographic data 
presented raises a question about what was expected to be found in all the 
isolated univariate type of studies 4hat werfe commissioned and that appear in t^ie 
* appendixes to On Further Examination. * 

An analysis of the ETS auditors' report for 1975 and 1976 shows that 
revenue from testing activities was approximately $49 million in 1975 and $56 
million in. 1976. The SAT prOduce<Tan estimated $9.1 million in 1975 and $9.8 
'millionin 1976. In both years this equaled about 18 percent of ETS's annual 
revenue. 
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If only 7 percent of the 1976-77 high school graduates took the $AT-as was 
the»case in 1 95 1-52-there .would be an $8.3 million reduction in ETS's revenue. 
Such a reduction of. revenue would obviously have a significant impact on ETS's 
activities and staffing. The SAT is one of the corporation's greatest sources of 
income. For the Wirtz panel to have concluded anything about the SAT that 
wduld have produced less use of the test was highly unlikely. 

> ETS is not the only corporation earning a substantial amount of money from 
testing activities. Many bfook publishers profit ' from selling tests and booJcs that 
help produce good test results. For example, Harcourt, Brace, and Jo*variovich 
sells the Stanford Achievement Test and the OTIS Group Intelligence test; 
Houghton Mifflin markets the Stanford-Binet, Lorge-Thdrndike, and Iowa Tests 
of Basic Skills; McGraw-Hill owns the California Test Bureau, whifh sells the 
California Achievement Test battery, and International Business Machines-owns 
Science Research Associates. ^ 

In the Committee's opinion the problem is the unwillingness of the 
testing industry to apply contemporary technology to improve the state of the 
art in testing. r • 

The Wirtz report more than ^adequately answers the questions about the 
reliability of the SAT and attempts to answer the question about predictive 
.validity. - ■ 

The m6re significant validity questions about construct validity (the uncjer- 
• lying theoretical Tbasis of what is actually being measured by the instrument, 
combined with "supportive statistical and logical data from research studies) and 
•content validity>Jwhich relates to the content currently being taught in the 
schools) were not adequately investigated or at least n,ot reported. 

To use the concept of an "unchanging standard" and to begin to investigate 
the changes in schools and society for 25, 20, or even five y^ars dp not suggggfc 
that the most objective approach was used to evaluate the decline in the SaT 
scores. / , 

The Wirtz report provides a two-premise explanation about the 14-year SAT- 
score decline. 

The first premise portrays the decline for the first six or seven years as being 
caused by a markedly changing SAT-taking population. During this interval (1963 
to 1969) there were "larger proportions of characteristically loN^r-scoring groups 
of students." > 

The second premise attributes the decline in the last seven years to "factors 
in the schools and in society at large." The^changing nature of societal values 
caused the schools to attempt to provide a more diversified curriculum for the 
various groups of students who had not previously had the opportunity or need; 
in terms of employment, to reach high school or beyond. ^ 

TN CEEB panel had to resort to explaining the score decline .^between 
1970-77 to "circumstantial evidence." In JPart Fourpf the report more than 50 
theories were examined and discussed by the panel/Each of these theories held 
• three assumptions in common: One, "that stnee the problem has been reduced lt£> 
a single statisticr-the drop in these averages- there must be a' single ^aaps^er; 
second, that what has happened is ?TT every respect bad; and third, that vvlfillfver 
cause s d it is somebody else's fault." n 

The panel's "only certain conclusion *s that we are dealing here with a 
.virtually seamless web of causal connections. [The] most ftifical elements emerge 
more clearly in looking first at some developments inlhe schools, then at several 
major societal changes, and finally at the murky but probably vital area of youths' 
motivations." 

Twenty-seven published appendixes were reported along with the findings of 
the CEEB SAT score decline. There was extensive tyse of (inscriptive statistics and 
studies with nonrandom sampling^ selection techniques. %Nonrandom sampling 
restricts the panel's ability to make generalizations about students in all 50 states 
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and the l^DOO-plus school districts. Instead, the panel was forced to make deci- 
sions based on isolated studies and what it termed "circumstantial, evidence." 
bpecitically, the conclusions relating to television were termed "e&entiallv 
subjective. - - . . ' 

The report had an overriding tone throughout about "traditional" standards 
and values, which were challenged (parenthetically) by limiting statement/ in the 
report "However, the statements of consensus provideionly subjective,' tWeri- 
eralized. Conclusions about the SAT score decline. Intact, the two types oTscore 
decline between the arbitrary " 14-year interval of fl $63-77 were attributed to 
changing membership of the population tested" and "six : other sets of 
developments." - " 

,n-,, T ? e six other sets of deve '.°Pments were determined to have a beginning iri 
197k It was acknowledged. Jhat the "forces" began before 1971 ; however, the 
effects can only be attributed to the l\2J--7? interval. 

Why there are six sets of developments rather thau-One, two, three, or nine is 
^ not adequately, addressed in the report. Frequent reference, i§ made to the 
dynamics, of change iitfsociety and the" historical consistency (reliability) of the 
SAT without any reference or question abou t tkeJ Ealiditv of the SAT as a 
surrogate for society and its unchanging standards!^^. 

It appears thaHhe CEBB-ETS report "could have been written "by any panel 
^ charged with developing circumstantial evidence about the decline of the SAT 
, scares over the.past 40, 30, of 25 years. The studies .that were used as a basis to 
reach -conclusions -do not proyide a scientific data base on which (to make an 
objective evaluation about the alleged decline in SAT-scores. ' 

• oa-t r EA S $P?cial. Committee raised the question about the continued use of the 
SAT for selecting students for college admission. An aaaiysis-of continued use 
produced the following conclusion. ' „ 

The SAT is considered to be a. maximum performance test. It was/4signed 
to predict success in college. Tests of maste/y of school subjects are called 
achievement tests. The SAT is an aptitude test anj not an achievementtest. The 
difference between achievement and aptitude tests is in the -way in which they are 
used. * , . 

A test is generally referred to as an achievement test when' it is used to 
determine a person's success In "past" study. The same test when-used to'forecast 
future success in a course or -assignment is generally referred to as afi aptitude test 
The way the test is used determines the "classification of the' test. Generally 
■ when a test, such as the SAT, is used to predictjttture academic performance 
prognosis!, it is classified as an aptftude test. When a test is used for diagnoskit is' 
referred to as an achievement test^ rt mfc. 

• . Teachers use diagnostic (achievement)' tests in their day-to-day teaching of 
, students. The use of a testjto analyze a student's performance on a setof tasks to " 
improve learning is an appropriate use of tests. 

• The, use of a test to feject or select an individual for college or employment • 
.teftds to fo#Cer racism^ elitism, cfassism, and separatism. The CEEB--ETS report : 
provides man* examples of how the SAT does, discriminate againrfstudents who 
belong to the lower socioeconomic groups, minorities, and womenVIt is precisely ' 

Jor these types of reasons that the NEAjis- searching for a different means, to 
measure achievement and to do away with aptitudelests. * * 4 ' 

• The PfStfictive validity of the£AT does not compare favorably to the 1 grades 
given by a teacher as predjctor of future success In college. The CEEB-fiTS 
reported findings of the studies on the predictive validity of a student's scores on' 
the SAT and a student's high schooLgrades were consistent with previous studies 4 
Bloom, and Peters reported a validity coefficient between high school and college 
grades af .5 dating as farback as 1926. The point to be made is that for more 
than 50 ye.ars, high school grades have been the best predictor of college grades 
The use of the SAT adds vepy little, to a college's" ability to predict future success 



It would seem more rational to use both an achievement test and grades to 
determine a student's current ability to perform. At least this approach would 
help in the diagndsis and future development of the student. 

In summary, it appears that the SAT cannot adequately predict a student's 
successjn college. Furthermore, the changing needs of society, families, students, 
and teachers conflict with the ''unchanging standard" that the makers of the SAT 
profess to have built into the test, k 

NEA's Special Committee reviewed other published articles and references 
about the SAT, including, an article by Ralph W. Tyler, a member of the Wirtz 
panel, fn which both he arid Benajmin S. Bloom, another panel member, provide 
their own explanations about the score decline. Tyler commented about 
children's achievement and Bloom about the score decline. Their conclusions are 
as follows. * 

* 

Tyler: S - 

The available data regarding ths, educational achievements of our children are 
not wholly consistent with the trend in Scholastic Aptitude Test scores^. The 
National Assessme nt of ] Educational Progress, for example, furnishes 'information 
on the educatioja^facfiiSVerhents of a reliable sample of nine-year-olds, thirteen- 
•^ear-o'lds, seventeen-year-olds, and young adults, ages twenty-six to thirty-five. In a 
survey taken first in 197!, and again In 1975, National Assessment found that, 
nationwide, an estimated fifty thousand more nine-year-olds were able fo respond 
correctly to a typical reading item in 1975 than in 1971.. The reading performance 
of seventeen-year-olds has also improved somewhat during the past four years. On 
the other hand, reading achievements of thirteen-year-olds has changed little during 
this period. 

Jn mathematics, National Assessment found that ninety per cent of seventeen- 
year-olds can add, subtract, multiply, and divide accurately with whole numbers, 
but only forty-five per cent can use these computational skills properly in working 
out Unit costs, the amount of income tax due, and other quantitative problems 
often encountered by adults. In science, in 1969-*1970 when the mass media was 
emphasizing the importance of science, thirteen-year-olds and seventeen-year-olds 
performed' five per cent better than four. years later when science was given less 
favorable treatment the press. In writing (composition), the. average score of 
thirteen-year-olds/and seventeen-year-olds has declined consistently. 

The -Scholastic Aptitude Test data show that {he decline has been greater ]n 
the. verbal sections than in the-mathematics ones,jtnd has been greater in the 
sections testing vocabulary than in reading. . 

Bloom: 

I thmk there is a lot wrong with American education, but the Scholastic 
Aptitude Test is not where you are going to identify it. Jhe S.A.T. comparative 
figures are based on the 1941 version df S.A.T., when approximately forty-one 
thousand students-most of them going to Ivy League colleges— took the test.. 
Today, about two million young people are going to colleges, mostly public; about 
a million.and a half take the S.A.T. test. 

The first major drop in S.A.T. scores took place between 194J and 1951. By 
1951, about half a million students were taking the tests, ano*manv of them were 
heading for institutions other than Dartmouth or Swarthmore or Yale. 

From 1951 to 1977, the drop, in the verbal score has been about fifty points 
and the drop in the mathematics score about thirty points. About twenty-efght of 
thejlfty points in the verbal-score Recline and twenty-three of the thirty points in 
the mathematics score are attributable to the change in the composition of the 
college population /hiring that period. In 1951, more than half of the students who 
took the S.A.T. were in the upper iwenty per cent of their high-school class. Today, 
about a third of the studejits-taking the S.A.T*. are in the upper twenty per cent of 
their class. The compositional change does not refer to blacks or Chicanos. It refers 
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for the most part to white children coming from Afferent sectors of their high- / ; 

school graduating class. , / ♦ 

I should point out that the rest of the drop 4 in the S.A.T. scores-that is, that / *' 
which we cannot clearly account for-concems three test items out of approxiw 
mately ninety in the verba! test and two items out of approximately seventy-five or/ a 
eighty in the mathematicaljest. * 

It is also important to note that the S.A.T. is a speed test. Almost no student , 
can finish the xm in the allotted time. If you were to let each student have as much 
. time as h§ wanjted, the distribution would be very different. At one time, students 
would prepare fc^several weeks-some of them would prepare for a year-getting 
ready for the S developing speed in answering questions and solving problems. 
There is very little of that kind of preparation now. Also, student* used to repeat 
the S.A.T. and increase their score by about thirty points. Today, students take the 
S.A.T., and whatever score they get, they let it stand. The number of students 
retaking the S.A.T. between their junior and senior year in high school has de- y 
creased enormously. In the minds t)f students, the importance qf the S.A.T. as the 
.major gatekeeper in American ^education has dropped significantly. fc 

c 

In addition, Tyler identifies a number of implications for^ the classroom and 
for society. There is^a need for mtflre writing assignments, the critical use of 
television as a supplementary resource in the learning process, and the examina- 
tion of the out-ofechool educational environment. 

Finally, it is reassuring to know that there is no evidence to support the view 
that children are learning less today. Ther&is a need to determine what and when 
society wants students to learn what is deeded valuable and important. If there is 
a need for writing assignments, there will havfe to be accommodations both wit 
the school curricula and in the out-of-schoo}experiences. 
* * 
OBJECTIVE TWO: To review NEA 's current policy on standardized testing, 
considering the following! CEEB's On Further Examination, the attempt on 
the part of selected members of thesJJ.S. Congress to pass federal legislation 
on testing, the related impact on local school district curricula, and teacher 
evaluation. ~" 

The second' objective was principally ^directed toward the r^tew^ ind 
examination of current NEA policy on standardized tests. After studying the- 
report of NEA's Task Force on Testing, the Committee concluded that there was 
a need to rewrite the current resolution. 

The proposed resolution appears in the recommendations. 

OBJECTIVE THREE: fo develop a set of recommendations for presentation 
to the various levels of NEA governance and to alert standing committees of 
policy recommendations. 

The Committee developed three policy and five prografn recommendations. 
The recommendations apoear on pages 53-55; 
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Appendix D 

NEA 1980 RESOLUTIONS CONCERNING TESTING, 
CRITERION-REFERENCED TESTS, AND TRUTH IN-TESTING 



H-10. Testing 



ERIC 



- * ^ » 

The Natfonal Education Association recognizes th#t testing of students, preschool 
through job entry, m#y be appropriate for such purposes as — 

a. • Identifying learning needs % 

b. Recommending instructional activities 

c. Describing student progress. *" N 

The Association opposes the use of tests that deny Students full access to equal 
educational opportunities, or that are uspd to evaluate teachers. 

V 

The Association believes that standardized tests should not be administered when 
they are — * + 

a. Potentially damaging to a student's self-concept 

b. Biased 

c. Used as the only criterion for student placement 

d. Invalid, unreliable, or out-of-date ' * „ 

e. Used as a basis for the allocation of federal, state, or local funds 

. f Used by testing companies or publishers to promote their own financial 
interests at the expense of sound educationa^uses 

g. Used to compare individual schools 

h. Used in an exploitive manner by the media \ * 

i. Used as the sole criterion for graduation or promotion * 
j. Inappropriate for the use intended. 



Revised resolution. 
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H-ll. Criterion- Referenced Tests 

The National Education Association believes that criterion-referenced tests are a 
viable alternative to standardized norm-referenced tests. Such tesrs should be 
designed to describe student performance based on carefully developed ayriculum. 
It is. inappropriate to administer criterion-referenced tests that do not specifically 
measure instructional content. 

Staff, time, instructional materials, and other resources Should be provided to assist 
students who experience difficulty achieving the desired criteria reflected by tests. 



New resolution. 
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H-12. Truth-in-Tesling 

The National Education Association believes that intelligence, aptitude, and 
achievement tests have historically been used to differentiate rafher than to measure 
performance and have, therefore, presented equal educational opportunities for all 
students, particularly minorities, lower socioeconomic groups, and women. Con- 
temporary research bp the structure of the intellect identifies multiple and varied 
mental operations and advances the significant premise that these operations can be 
v aught, that intelligence is dynamic rather thanTixed. 



The Association further believes thatjjie truth-in-testing movement is an important 
step for bringing about long-needed test reform. Therefore, it urges all state 
affiliates to strive for passage of truth-in-testing legislation that includes a provision 
for each i ndivi dus kiest taker to receive a copy of all test questions, scores, and 
rationale foe correct answers. 

New resolution. 
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Appendix E 

NEA*S ANALYSIS OF H.R. 3564 AND H.R. 4949 

Two legislative proposals concerning educational testing are before the Committee 
on Education and Labor. The first proposal, referred to as "Trut h-in-Testing Act of 
1979" (H.R, 3564), was introduced by Rep. Gibbons. The second proposal, the 
"Educational Testing Act of 1979"(H.R. 4949), was introduced by Rep. Weiss. The 
latter proposal (H.R. 4949) is based on New York legislation proposed and passed 
during the summer of 1979. / * ♦ 

H.R. 3564 and H.R. 4949 concern the use" of standardized tests, a subject about 
which NEA has raised questions and expressed concerns. Because of the NEA 
concern with the uSe of standardized tests, both proposals have been analyzed in 
terms of their similarities, their differences, and their responsiveness to NEA 
cortceras. „ 

In general, NEA believes that the two proposals represent sojnewhat different 
approaches to the use of standardized tests. To the extent that H.k. 3564 and 
H.R. 4949 are responsive to NEA concerns, both proposals should be supported. 
The Gibbons "Truth-in-Testing Acf (H.R. 3564), however, is expected togeherate 
more opposition in Congress and could, if passed, prove to be a less successful 
vehicle for meeting the concerns expressed by NEA. • 

Both H.R. 3564 and H.R. 4949 represent notice and disclosure legislation. They 
% differ substantially as to the type of tests covered, the extent of involvement of the 
Commissioner of Education and the type of enforcement provisions. H.R. 3564 
would cover the National Teacher Examination which is a concern of NBA. The bill 
would also cover dther occupational t§ sts, which will engender opposition, and tests 
other than standardized tests, regulation of which would probably prove unworka- 
ble. For the most part, the disclosure requirements of H.R. 3564 require the type of 
information currently provided voluntarily by* testing agencies such as ETS. 
Because H.R. J564 does ndt requke disclosure of underlying data on the examina- 
tions, it would not enable professionals outside the testing industry, including 
teachers, to analyze or comment on test construction and validity. In addition, 
H.R. 3564 fails to provide for disclosure of scoring data in addition to test scores 
whjch may be given to educational institutions. Groups favoring-testing disclosure 
laws have stated that testing agencies provide information such as suspicions of 
cheating unacknowledged repetition'of a test and factors based on current school 
attended to be used in evaluating the sc6re. Sttidems-have-not been informed of this 
'tft>e of information where it is incorrect. In addition, students have not been 
provided with their test answers and the correct answers. Thus, students have been 
unable to learn of or rorrect computer gratiiitg errors. The'Gibbons bill does not 
address this problem either. f 

* - - ' " A 

In contract, each one of the instances noted above is addressed in the Weiss bill with 
the exception of occupational testing. Various portions of the Weiss bill could use 
clearer and better language / In addition, some consideration should be given to the 
viability of including financial regulation of the testing companies in this 
legislation. . ' * 

>14 ■■ ■ 
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In addition to standardized tests, H.R. 3564 covers "oral*' tests; "ppcticaPtestsand 
"demonstration" examinations. Se^. 2(3). The bill apparently reaches practical or 
demonstration examinations used in occupational licensing such as barbering,oral 
examinations such as the foreign service examinations, and practical or demonstra- 
tion examinations used in educational admissions such as submission of a portfolio 
to an art school or a stage performance required for a drama/school. Regulation of 
such tests would probably be unworkable. 

. 3564 contains both pre-test (Sec. 6(a)) and post-test (Sec. 6(b)) disclosure 
requirements which require information to be provided tol test takers. Prior to 
administration of the test, each applicant must be provided with a written notice 
containing essentially the types of information currently provided voluntarily by 
"the testing companies: 

1. A detailed description of the area of knowledge or the type of aptitude 
that the test attempts to analyze; f 

2. .In the case of a test of knowledge, a detailed description of the subjects to 

be tested; N 

3. ' The margin of error or the extent of reliability of the test, determined on 

the basis of experimental uses^ of the test and, where available, actual 
usage; 

4. The manner in whicft the test results will be distributed by the testing 
entity to the applicant and to other persons; and ~ 

5. A statement of the applicant's [post-test notification] rights. 

The post-test notification provision requires that "promptly upon completion of 
scoring" the test taker must be notified of; } 

1. The individual's specific performance in .eacfVof the subject or aptitude, 
areas tested; 

• 2. HowTfiafspecific performance ranked in relation to the other individuals 
and how the individual ranked on total test, performance; 

3. The score required to pass the test for admission to such occupation or 
the score which is generally required for admission to institutions of 
higher education; ^ 

4, Any further' information which may be obtained by the individual on 
request. 

Section 6(c), the final substantive provision of the bill, prohibits the scoring of 
' % achievement tests on the basis of a curve: • *: ; 

c. No educational or occupational admissions test which tests knowl- 
) edge or achievement (rather than aptitude) shall be graded (for 

purposes of determining the score required to pass the test for 
admission) on the basis of therelativedistribution of scores of other 
test subjects. 

The enforcement provisions of H.R. 3564 (Sec. 7) authorjze.private causes of action 
by an aggrieved individual "whenever any person has admihistered or there are* 
reasonable grounds to believe that any person is afcout to administer any test in 
violation of this act." The bill specifically provides for "preventive relief* including 
a permanent or temporary injunction and restraining orders and for appointment 
of counsel "in such circumstances as the court may deem just. " The bill authorizes, 
attorney's fees (Sec. 7(b)) and provides for federal court proceedings without regard 
to exhaustion of remedies. Sec. 7(c). 
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The enforcement pfocedures of injunction or restraining order represent onerous 
remedies, and it seems doubtful that federal courts will be indited to enjoin the 
administration of standardized tests such as the SAT. For this r(JB$m€*rvQe dies 
provided by the bfipapp^r usj be ineffective. Since the bill specifiljj^thorizes "a 
civil action for preventive relief; courts may find that such relief is the exclusive 
remedy for violations of the Act. 

The "Educational Jewing Act of 1979" (H.R. 494?) identifies three legislative 
purposes (Sec. 2(b)): m 

1. "R eftsure that test subjects afnd person* who use test results are fully 
aware of the characterise uses, and limitations of standardized tests in 
postsecortdary education-admissions; 

2. To mal^e available to the public appropriate information regarding the 
, procedures, development, and administration of standardized tests; and 

3. To protect the public interest by promoting-more dependable knowledge 
about the limits of appropriate usage of standardized test results and by 
promoting greater accuracy, validity, and reliability in the development, 
administration, and interpretation of standardized tests 

This bill requires more extensive pre-test disclosure to test takers than H.R. 3564 
and, unlike H.R. 3564, specifically requires that the pre-test notice be provided 
contemporaneously with the test registration form. Sec. 3(a). The legislation specif 
ically addresses the coachaj)ility issue ai^d requires testing agencies to inform 
individuals of the extent to which their scores maybe improved* by taking a 
preparation course. Pre-test notice must inchide the following information: 1 

1 The purposes for which thenest is constructed md is intended to be used. 

2. The subjea matters included on such test and the kaowledge and skills 

* ^ which the test purports to measure. 

3. Statements designed to provide information for. interpreting the test 
results, including explanations of the test, and the correlation between 
nest.scores and future success in schools and, in the case of tests used for 
post baccaiaureate admissions, the' correlation between test scores and 
success in the career for which admission is sought. 

4. ' Statements concerning the effects on and usTes of test scores, including— 
a. if the tes| score is used by iteslf or with other information to predict 

future grade poim average, the extent, expressed as a percentage, to s 
90 which the use of this test score improves the accuracy 'of predicting 

• , - future g rade point average, over and above all other information 

used; and , . . ' 

* . ' • « " * 

J>. a comparison of the average scoreand^percentilesof test subjects by 
major jncom^ groups; and 

• ■ ' c. the extent, if available to tlie testagency, to which test preparation 
cpurses improve test subjects' screes on .average, expressed as a 
* percentage. 

.5. - A description of the form in which test scores will be reported^ whether 
/ . the raw test scores will be altered iij.any way before being reported to the 
/ * test subject, and the manner, if any/in whtch the test agency will use the 
test score (in ra*w or transformed form) by itself or together with any 
•other information about the test subjfect to predict in any way the 
subject s future academic performance for any postsecondary educa- ; 
tional instftution. lt % 
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6. A complete description of any promises or covenants that the test agency 
makes to the test subject with regard to accuracy of scoring, timely 
forwarding or score reporting, and privacy of information (including test 
scores and other information) 1 , relating to the test subjects. 

7. The property rights of the test subject to the test results, if any, the 
duraticfli for which such results will be retained by the tesf agency, and 
>olicies regarding storage, disposal, and future use of test scores. 

ite by which the test subject's test scores wty be completed and 
to the test subject. ' 

9. A description oft special services to accommodate physically handi- 
capped test subjects. , ' 

In addition to providing notice to test subjects, the bill requires the testing agenCy to 
provide the same information to the recipient institution prior to or coincident with 
tbe-feportmg of test scores. - — 




The major area covered by the Weiss bill is reporting to governmental educational 
agencies. Two types of information must be disclosed to the government. First, this 
reporting requirement concerns the studies and evaluatiQns of the tests themselves 
and is designed to allow professionals outside the testing industry, including 
^achers, access to such studies to allow independent analysis of the construction, 
validity and use of the tests. The second type of information to be disclosed includes 
the test questions and answers and scoring rules. This is accomplished by cross- 
reference to the Freedom of Information Act, 5U.S.C. Sec. 552(a)(3), which 
authorizes release of records. The test agency\is required to provide to the Commis- 
sioner of Education: ( j 

9 Any study, evaluation, or statistical report pertaining to a test, which a test 
agency prepares or causes to be .prepared, or for which it provides data. 
Nothing in this paragraph shall require submission of afly reports or docu- 
ments containing information identifiable with any individual test subject. 
Such information shall be deleted or obliterated 'prior to submission to the 
Commissioner, [and] • " ^ tt 

1. shall, within 30 days after the -results' of any standardized test are 
released, file or cause to be filed irHfeeoffice of the Commissioner — 

. a. a copy of all test questions used in calculating the test subject's 
raw score; 

b. the corresponding acceptable answers to those questions; and 

c. all rules for transferring raw scores intp those'scores' reported 
to the test subject and postsecondary educational institutions 
together with an explanation of such rules? 

This data, in addition to being made available puxsuant to the Freedom ofWorma- 
tion Act, must be made available by the Commissioner of Education to state 
educational agencies and commissions. 

The" testing agency must also provide the questionsrthe correct answers, and the test 
taker's answers, as well as scoring infofmation^to theiesLsubject on request for a 
90-day period .subsequent to release of the test scores. 

Furthermore, the legislation requires the Commissioner of Education to prepare 
for Congress an evaluation of the data on these tests both with regard to coachabil- 
ity and cultural bias: • 
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the effective date of this^Act concerning the relationship between the 
test scores of test subjects and income, race, sex, ethnic, and handi- 
capped status. Such report shall include an evaluation of available 
data concerning thqfrelationship between test scores and the com- 
pletion of test preparation courses. 



The major difference between the Weiss draft and the New York law upon which it 
is based is an attempt in the federal legislation to'regulate the costs to test-subjects of 
the tests and to require financial'disclosures by the testing companies. During the 
New York hearings, the testing companies ar^foed th£t costs would skyrocket. 
Proponents of the^Iew York legislation, New York Public Interest Research Group 
and Nader in particular, questioned these pedictions using whatever data they could 
obtain from the testing companies, especially ,^TS. Sectipn 7 of the^bill enjtitied 
"Testing Costs and Fees to Students" provides as follows: 

§ In order te-cnsure that tests are being offered at a reasonable cost to tesf 
subjects, each tfst agency shall report the following inf?^^^attot^-to^he^ 
Commissioner: 

1. Before March 31, 1981, or within 90 days after it first becomes a test 
ageiwy, whichever is laten^thejest^agency shall report the closing date of 
its tftting year. Each tesfagency shall report any change in the closing 
date of its testing year within 90 days after the change is made. * " 

2. For each test program, witfiin 1 20 days after the close of the testing year 
the test agency shall report: . 

a. the total number of times th? test ^as taken during the testing year; 

b. * the number of tesUsubjects who have taken the test once, who have 

takenjt twice, and who have taken it more ftian twice during the 
tesffng'year; - # w 

c. the number of refunds given to individuals who bbVe registered for, 
but did .not take, the test; i ' , 

d. the numb* of test subjects for whom the test fee was waived or 
reduced; n 

e. the total amount of feeS received from the test subjects by the test 
agency' for each test program fcfr that test year; 

f. the total amount of revenue received from each test program, agd 

g. the expenses 1 to. the test agency of the tests, including: 

(1) expenses incurred by the test agency for^ach test program; 

(2) expenses incurred for test development by the test agency for 
each test program; ahd } 

(3) all expenses which* are fixed or can be regarded as overhead 
* expenses and not associated with any test program or 4 with test 

development? 

3. If a separate fee is charged test subjects for admissions data assembly 
. .services or score reporting services, within f20 days after the close of the 
" "testing year, the test agency shall report: 

r a. the number of individuate registering for each* admissions data 
assembly service during the testing year; 

b. ; the number of indivfAuals registering for each score reporting ser- 
vice during the testing year; 
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c. the total amount of revenue received from the individuals by the test 
agency for each admissions data assembly service or score reporting 
sefvice during the testing year; and 

d. the expenses to the test agency for each admissions data assembly 
service or score reporting service during the testing year. 

\ 

The Weiss bill, like the New York legislation, uses a civil penalty as its remedy. 
While the New York law establishes a $500 penalty per violation, the federal law 
establishes a $2,000 fine. This would represent a small penalty where the test agency 
failed to properly report to the Commssion of Education since this would probably 
constitute a single violation. With regard to violations of the notice to students the 
penalties could be substantial since presumably failure to provide thejequired 
notices to students would result in multiple violations reflectinfifftie niimber of 
students involved. One potential difficulty in enforcement may bp in determining 
which and how many individuals'were not given proper notice or timely reporting. 
The Commissioner is authorized by the draft to promulgate regulations^) imple- 
ment the legislation and enforcement would be one area where regulations might fill 
in the sketch created by the draft legislation. 

. The Weiss bill would require disclosure to studentsx>f covenants and promises made 
by the testing agencies. Private causes of action by test takers could be based on 
breaches of these contractual warranties. * ' 



FOOTNOTE 

* 

The phrase "and other information" was added by Weiss's staff subsequent to 
conversations with NEA. Significant questions exist as to the use made by ETS of 
personal data obtained on the test or test application. ETs sells student lists to 
institutions. 
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Appendix F 

EXECUTIVE OFFICE 



NATIONAL EDUCATION. ASSOCIATION • 1201 16thSL,/t.& 9 Washington, DC 20036 • (202) 833-4000 

JOHN RYOR, President * ^ TERRY HERNDON, Executive Director 

wtU/feD H. MGGUtRE, Vice-President * 

JOtlN T. MCGARlGAL,.Secrelary»Treasurer ; 

June 8> '1979 
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^ Dr. Warren G. Hill 
* Executive Director 

Education Commission £>£_£he States , 

1860 Lincoln Streel 

Denver, Colcfradp^80203 

Dear Dr.- Hill.: ' < ' ' 

The National Education Association strongly supports the 
, Education- Commission of the States 1 application to continue 
as the organization responsible for the National Assessment 
of Educational Progress. The National Assessment has gained 
respept from teachers, administrators, and educational policy 
makers at all levels of the education community over the last 
fifteen years. - 

NEA advocates measurement techniques and approaches which 
help policy makers formulate intelligent decisions about 
school programs. The National Assessment has provided this 
information in yie past and it is hoped that the program can 
be extended down into the local school districts to replace 
the current fad of competency testing. 

NEA strongly- supports the makeup of the National Assessment 
Policy Committee which includes* teacher representation on 
the committee. The Association would urge that the Federal 
Government Continue this practice and require thajb teachers 
be represented on the National Assessment Policy Committee 
in direct proportion to their national membership. The four 
teachers on the committee should be- designated by the majority 
organization or, if* this is «not possible, allocated to teacher 
organizations according to membership. Administrator * and 
school board organizations should designate their Representatives^ 

The NEA recommends that the ECS be granted^fche^funds to continue 
thf NAEP. 



Sincerely, 

A 



. Terry^Herndon ' t 
Executive Director y 
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